Behavior of threading with reduce_sum()

ssp3nc3r · May 1, 2020, 5:28pm

For reduce_sum, I’d like to set threading to use all available threads such that chains are balanced but as chains finish, the slower chains can use those earlier in use by the faster chains.

The reduce_sum tutorial, which uses the cmdstanr interface,

https://mc-stan.org/users/documentation/case-studies/reduce_sum_tutorial.html

Set the number of threads each chain will use with set_num_threads

set_num_threads(2)

explaining,

As this computer has 8 cores and we intend to run the usual 4 chains, we will use 2 threads per chain to make full use of the processor (4 chains with 2 threads each can make use of the full eight cores).

But in this approach, slower chains do not start using additional threads when faster chains finish.

wds15 · May 1, 2020, 9:58pm

Yeah, this is what I would like to do as well. We have not explored this option yet, but it does make a lot of sense. Say you have 8 cores and want to run 4 chains. Then you can simply over-subscribe each chain. That is give each chain 4 cores, for example… and also start 4 chains such that in total you subscribe 16 cores - more than you have on your machine.

I think that the TBB will do the right thing and handle the situation efficiently, but I did not have the time to try this out yet and would be curious what you find. So go ahead!

(I was super busy with getting this into 2.23)

wds15 · May 4, 2020, 7:04pm

Bump. So what did you find?

ssp3nc3r · May 4, 2020, 9:00pm

Anecdotally (just one model), specifying all cores for each chain worked fine, and when one chain finished, all cores remained in use until the last chain finished, but this approach was significantly slower. I’d guess it has something to do with efficiency in whatever handles scheduling of threads, but my intuition isn’t based on specific understanding. Also, my test wasn’t very scientific as I was trying to get the model finished, but I can try more systematic tests with specific models, and it would also help to understand how thread scheduling works to know what behavior we might expect.

wds15 · May 4, 2020, 9:26pm

Thanks. We use the intel tbb library if you want to read the details.

Topic		Replies	Views
Reduce_sum cores, chains, threads Interfaces cmdstanr	13	1971	May 28, 2020
Stan threads/reduce_sum doesn't seem to make any difference General performance	3	413	July 21, 2021
Grainsize when (chunks/cores) < 2 Modeling	7	692	March 11, 2021
Help with reduce_sum Modeling	32	1722	August 4, 2020
What would happen if no. of shards or partial sums is greater than threads per chain? CmdStan	4	460	July 29, 2020

Behavior of threading with reduce_sum()

Related topics