Threading, MPI, and TBB for users

I’ve seen a lot of discussion about Stan 2.21’s incorporation of the Intel TBB library, and I’m curious under what situations users should expect to see the effects. My understanding of the previous parallelization system is that you required either threading or MPI; threading would only work on a single machine/node, while MPI worked across nodes and had generally better performance, but was harder to setup. They could also be combined.
How does TBB change this?

  • Does TBB have any direct effect on MPI?
  • Is it now recommended to use threading/TBB over MPI on single node/machine jobs?
  • Does TBB affect anything that isn’t part of a map_rect call?

Thanks for your time.


Here is the short what-I-recommend to users recipe. This is assuming that your program is written with map_rect and you know it benefits it… then:

  1. Start with TBB threading as its much simpler to setup
  2. Stay with TBB threading in case you do not have a super-fast network like infiniband to link computers
  3. Only if you have a super-fast network and you are willing to deal with MPI setup (involving setting up Stan-math and your computing environment) then switch over to MPI and disable threading.

So basically stay with the TBB in almost all circumstances. In case one machine does not have enough power for all your chains, then exploit different machines by starting one chain per machine. Then on each machine you use the TBB.

The rational is that

  • Turning on threading is a lot easier: It’s merely a compiler switch and that’s it!
  • No need to find out about your MPI compiler
  • No need for a queuing system
  • No need for a fast network

So MPI is there to rescue you in case you have to go really big.

Now, there is one more really cool novelty since 2.20 of Stan: Turning on threading used to cost you up to ~20% of performance. So a single-core run went ~20% slower just by turning on threading. This performance penalty of threading is basically gone. I would actually recommend to just turn threading on and forget about it.

The TBB is at the moment only explicitly used in map_rect - more is to come. However, that is only half the truth. On MacOS we use the tbbmalloc_proxy library which replaces the system malloc on MacOS - this speeds up my Stan programs by ~15% even for single-core performance. This is turned on by default regardless of threading or not. This speedup hasn’t been seen on Windows or Linux which is why this is not turned on for these platforms.

I hope that clarifies matters.

In super short: Just turn on threading and don’t worry. Scale over different machines via starting chains on different machines. In case you really need more then switch to the MPI story - but only as a last resort.


One quite important point in my opinion is the cost of calling multiple


in the same model. With TBB will there be a cost to calling map_rect ten times instead of packaging everything (super hard, and bug prone) and call map_rect once?

Just call it ten times if that avoids bugs!

Provided each of these calls does some reasonable amount of work which warrants the parallelisation over-head, then I do think that multiple map_rect calls should be just fine.

Thanks, so even with TBB we have to be careful to start new map_rect. My understanding was that with TBB all calls go into a pile that gets shared across processes all together.

As a practical example, I have statements like

for(s in 1:150) simplex[s] ~ dirichlet()

Considered that it is not vectorisable, would this be worth it to create 150 shards and use map_rect, or maybe packaging in 5 shards and call map_rect (?)

Things go on a pile, yes… but one pile after the next (unless you nest it).

For each map_rect job you create there is still some considerable overhead going along with it. So this loop is probably better to split into bigger chunks… but I doubt that 150 dirichlets are worthwhile to parallelise.

Hopefully we have the AD system refactored in the way I envision it at some time soon. Once that is done, the over-head due to parallelism will be much less… but changing the AD core requires great care and caution, of course.


Doing threads over multiple nodes in a cluster seems like the more complex hybrid parallelism. I think people could benefit from a case study of these more advanced types of parallelization (e.g. multiple machines with threads etc).