Reduce_sum cores, chains, threads

Following Reduce Sum: A Minimal Example and wondering how to use a cluster of cores for speed.

  1. num_chains * num_threads = # physical cores ?

  2. num_cores = num_chains ?
    num_cores = num_chains * num_threads ?

To parallelize Stan over different chains (not within chains), I think this discourse post recommended:

  1. num_chains = # physical cores ?
  2. num_cores = num_chains ?

thank you !!

1 Like

I would go with the first option 1, but if you find out something different …then let us know.

Edit: To only consider physical cores works best.

Thank you !

What do you mean by the first option ? To be more clear, I have two separate questions:

  1. Do I choose num_chains (set in sample()) and num_threads (set in set_num_threads()) such that num_chains * num_threads = # physical cores ?

  2. Do I choose num_cores (set in sample())
    (a) such that num_cores = num_chains ?
    (b) such that num_cores = num_chains * num_threads ?

I meant 1, but feel free to explore other options, but this is what works knowingly well. In case you gain valuable experience, please share.


thanks ! so you meant to say “yes” to question 1, i.e. we should take num_chains * num_threads = # physical cores.

Do you happen to know the answer to question 2, i.e. what should num_cores be set to ? whether we want (a) or (b) ? thanks again !


Equal to the number of physical cores.

At least that’s my experience. Stan needs a lot of cpu cache to work well such that hyper threading does not help.

Hi @wds15,

I’m a little confused, so sorry if this is a stupid question ;).
In “Reduce Sum: A Minimal Example” it also says to set the number of chains*threads (4 * 2 in the example) equal to the number of physical cores ( 8 in the example), but the number of cores is left equal to the number of chains (4 in the example). Should the number of cores have been set to 8? Or is there something special going on on that example?


1 Like

It would not matter. Cores sets the number of parallel chains running. So how many concurrent chains. Threads controls within chain cpu use.

1 Like

I get no speed-up at all, following Reduce Sum: A Minimal Example. I tried my local machine and a cluster of cores, different values for chains, cores, threads, but I rarely get any speed-up at all, and never anywhere close to the 2.7 speed-up in the case study.

Attempting in regular CmdStan, to see if any speed-up is possible there, see Cmdstanr reduce sum case study, but: unused argument (threads = TRUE).

Got it! :). Thanks again @wds15

1 Like

are others able to replicate the case study ? thank you !

I’m able to replicate the case study, but my local machine only has 4 cores, so I updated to cores=1, chains=1, and set_num_threads(4). I had a speedup of ~2.5 with that setup.

thanks @djgustafson ! unfortunately this didn’t work for me either, still unable to replicate the speed-up in the case study (or any speed-up at all !).

I was able to replicate the case study on a home assembled PC built around an Intel i9-7960X cpu (16 cores, 32 threads). I followed the case study exactly except I set the number of threads to 4 per chain. The timings were - unthreaded base model:

[1] 291.7603

  chain_id   warmup sampling    total
1        1 143.0066 147.9823 291.7224
2        2 139.7483 129.8698 270.4000
3        3 138.5047 123.3652 262.6477
4        4 140.6788 127.5929 269.0466

with reduce_sum:

[1] 81.71897

  chain_id   warmup sampling    total
1        1 41.84857 36.96113 79.13284
2        2 39.39713 38.37734 78.19834
3        3 39.60059 41.78128 81.69504
4        4 40.41441 38.59576 79.42406

so about a factor 3.6 improvement in total execution time. I haven’t experimented much with different chain/thread combinations except there appears to be no incremental improvement with chains*threads > number physical cores.

I’m running Windows 10 with R 4.0.0, Rtools 4.0, cmdstan 2.23.0, cmdstanr