In order to parallelize sampling I initially used mpi. As I understand as long as you are not using more than 1 node threading is not less effective. Right?
There is a thread Benefits of parallelization with a threadpool of the Intel TBB where it is shown that TBB is more efficient than MPI. Is there any way to invoke different TBB methods as demonstrated in the thread?
I have compiled cmdstan with the following flags in make/local
Does it mean that cmdstan for sampling uses GPU exclusively for Cholesky and CPU all other times? Does it use GPU for matrix operations?
If you are using
map_rect, which I presume you are since you are using MPI, the only thing you need to do in order to use TBB for threading is:
Regarding OpenCL support:
If you are using Cmdstan 2.21 the OpenCL the following functions will be run on the GPU if the input sizeis large enough:
- matrix mutliplication
The plan is to support most Stan functions with GPUs for the 2.22 release but that doesnt help you here.
At the moment the TBB is used to parallelise
map_rect and I would expect that MPI will give you the same performance. The thread you refer to doesn’t even compare against MPI as I can see. From my experience, the TBB
map_rect is now just as fast as MPI. So threading was lacking in speed behind MPI, but that slowness of threading is now gone with the use of the TBB.
Thanks for the help. Should I expect a difference between threading and MPI (in the favor of MPI)?