Threading and mpi and tbb and gpu in cmdstan

Hi,

In order to parallelize sampling I initially used mpi. As I understand as long as you are not using more than 1 node threading is not less effective. Right?

There is a thread Benefits of parallelization with a threadpool of the Intel TBB where it is shown that TBB is more efficient than MPI. Is there any way to invoke different TBB methods as demonstrated in the thread?

I have compiled cmdstan with the following flags in make/local
STAN_OPENCL=true
OPENCL_DEVICE_ID=0
OPENCL_PLATFORM_ID=0

Does it mean that cmdstan for sampling uses GPU exclusively for Cholesky and CPU all other times? Does it use GPU for matrix operations?

Thanks

Hi Linas,

If you are using map_rect, which I presume you are since you are using MPI, the only thing you need to do in order to use TBB for threading is:

Regarding OpenCL support:

If you are using Cmdstan 2.21 the OpenCL the following functions will be run on the GPU if the input sizeis large enough:

  • cholesky_decompose
  • matrix mutliplication
  • mdivide_left_tri_low
  • mdivide_right_tri_low
  • gp_cov_exp_quad

The plan is to support most Stan functions with GPUs for the 2.22 release but that doesnt help you here.

At the moment the TBB is used to parallelise map_rect and I would expect that MPI will give you the same performance. The thread you refer to doesn’t even compare against MPI as I can see. From my experience, the TBB map_rect is now just as fast as MPI. So threading was lacking in speed behind MPI, but that slowness of threading is now gone with the use of the TBB.

Thanks for the help. Should I expect a difference between threading and MPI (in the favor of MPI)?

No