Here is the short what-I-recommend to users recipe. This is assuming that your program is written with
map_rect and you know it benefits it… then:
- Start with TBB threading as its much simpler to setup
- Stay with TBB threading in case you do not have a super-fast network like infiniband to link computers
- Only if you have a super-fast network and you are willing to deal with MPI setup (involving setting up Stan-math and your computing environment) then switch over to MPI and disable threading.
So basically stay with the TBB in almost all circumstances. In case one machine does not have enough power for all your chains, then exploit different machines by starting one chain per machine. Then on each machine you use the TBB.
The rational is that
- Turning on threading is a lot easier: It’s merely a compiler switch and that’s it!
- No need to find out about your MPI compiler
- No need for a queuing system
- No need for a fast network
So MPI is there to rescue you in case you have to go really big.
Now, there is one more really cool novelty since 2.20 of Stan: Turning on threading used to cost you up to ~20% of performance. So a single-core run went ~20% slower just by turning on threading. This performance penalty of threading is basically gone. I would actually recommend to just turn threading on and forget about it.
The TBB is at the moment only explicitly used in
map_rect - more is to come. However, that is only half the truth. On MacOS we use the
tbbmalloc_proxy library which replaces the system malloc on MacOS - this speeds up my Stan programs by ~15% even for single-core performance. This is turned on by default regardless of threading or not. This speedup hasn’t been seen on Windows or Linux which is why this is not turned on for these platforms.
I hope that clarifies matters.
In super short: Just turn on threading and don’t worry. Scale over different machines via starting chains on different machines. In case you really need more then switch to the MPI story - but only as a last resort.