Behavior of threading with reduce_sum()

For reduce_sum, I’d like to set threading to use all available threads such that chains are balanced but as chains finish, the slower chains can use those earlier in use by the faster chains.

The reduce_sum tutorial, which uses the cmdstanr interface,

Set the number of threads each chain will use with set_num_threads



As this computer has 8 cores and we intend to run the usual 4 chains, we will use 2 threads per chain to make full use of the processor (4 chains with 2 threads each can make use of the full eight cores).

But in this approach, slower chains do not start using additional threads when faster chains finish.

1 Like

Yeah, this is what I would like to do as well. We have not explored this option yet, but it does make a lot of sense. Say you have 8 cores and want to run 4 chains. Then you can simply over-subscribe each chain. That is give each chain 4 cores, for example… and also start 4 chains such that in total you subscribe 16 cores - more than you have on your machine.

I think that the TBB will do the right thing and handle the situation efficiently, but I did not have the time to try this out yet and would be curious what you find. So go ahead!

(I was super busy with getting this into 2.23)


Bump. So what did you find?

Anecdotally (just one model), specifying all cores for each chain worked fine, and when one chain finished, all cores remained in use until the last chain finished, but this approach was significantly slower. I’d guess it has something to do with efficiency in whatever handles scheduling of threads, but my intuition isn’t based on specific understanding. Also, my test wasn’t very scientific as I was trying to get the model finished, but I can try more systematic tests with specific models, and it would also help to understand how thread scheduling works to know what behavior we might expect.


Thanks. We use the intel tbb library if you want to read the details.