Reduce_sum cores, chains, threads

shira · May 13, 2020, 11:56pm

Following Reduce Sum: A Minimal Example and wondering how to use a cluster of cores for speed.

num_chains * num_threads = # physical cores ?
num_cores = num_chains ?
or
num_cores = num_chains * num_threads ?

To parallelize Stan over different chains (not within chains), I think this discourse post recommended:

num_chains = # physical cores ?
num_cores = num_chains ?

thank you !!

wds15 · May 14, 2020, 11:54am

I would go with the first option 1, but if you find out something different …then let us know.

Edit: To only consider physical cores works best.

shira · May 14, 2020, 2:14pm

Thank you !

What do you mean by the first option ? To be more clear, I have two separate questions:

Do I choose num_chains (set in sample()) and num_threads (set in set_num_threads()) such that num_chains * num_threads = # physical cores ?
Do I choose num_cores (set in sample())
(a) such that num_cores = num_chains ?
or
(b) such that num_cores = num_chains * num_threads ?

wds15 · May 14, 2020, 8:04pm

I meant 1, but feel free to explore other options, but this is what works knowingly well. In case you gain valuable experience, please share.

shira · May 14, 2020, 8:41pm

thanks ! so you meant to say “yes” to question 1, i.e. we should take num_chains * num_threads = # physical cores.

Do you happen to know the answer to question 2, i.e. what should num_cores be set to ? whether we want (a) or (b) ? thanks again !

wds15 · May 14, 2020, 8:52pm

Correct

Equal to the number of physical cores.

At least that’s my experience. Stan needs a lot of cpu cache to work well such that hyper threading does not help.

Joran · May 16, 2020, 10:02am

Hi @wds15,

I’m a little confused, so sorry if this is a stupid question ;).
In “Reduce Sum: A Minimal Example” it also says to set the number of chains*threads (4 * 2 in the example) equal to the number of physical cores ( 8 in the example), but the number of cores is left equal to the number of chains (4 in the example). Should the number of cores have been set to 8? Or is there something special going on on that example?

Best,
Joran

wds15 · May 16, 2020, 10:58am

It would not matter. Cores sets the number of parallel chains running. So how many concurrent chains. Threads controls within chain cpu use.

shira · May 16, 2020, 3:04pm

I get no speed-up at all, following Reduce Sum: A Minimal Example. I tried my local machine and a cluster of cores, different values for chains, cores, threads, but I rarely get any speed-up at all, and never anywhere close to the 2.7 speed-up in the case study.

Attempting in regular CmdStan, to see if any speed-up is possible there, see Cmdstanr reduce sum case study, but: unused argument (threads = TRUE).

Joran · May 16, 2020, 3:52pm

Got it! :). Thanks again @wds15

shira · May 20, 2020, 6:49pm

are others able to replicate the case study ? thank you !

djgustafson · May 21, 2020, 4:25pm

I’m able to replicate the case study, but my local machine only has 4 cores, so I updated to cores=1, chains=1, and set_num_threads(4). I had a speedup of ~2.5 with that setup.

shira · May 27, 2020, 8:59pm

thanks @djgustafson ! unfortunately this didn’t work for me either, still unable to replicate the speed-up in the case study (or any speed-up at all !).

Michael_Peck · May 28, 2020, 2:22pm

I was able to replicate the case study on a home assembled PC built around an Intel i9-7960X cpu (16 cores, 32 threads). I followed the case study exactly except I set the number of threads to 4 per chain. The timings were - unthreaded base model:

 fit0$time()
$total
[1] 291.7603

$chains
  chain_id   warmup sampling    total
1        1 143.0066 147.9823 291.7224
2        2 139.7483 129.8698 270.4000
3        3 138.5047 123.3652 262.6477
4        4 140.6788 127.5929 269.0466

with reduce_sum:

fit1$time()
$total
[1] 81.71897

$chains
  chain_id   warmup sampling    total
1        1 41.84857 36.96113 79.13284
2        2 39.39713 38.37734 78.19834
3        3 39.60059 41.78128 81.69504
4        4 40.41441 38.59576 79.42406

so about a factor 3.6 improvement in total execution time. I haven’t experimented much with different chain/thread combinations except there appears to be no incremental improvement with chains*threads > number physical cores.

I’m running Windows 10 with R 4.0.0, Rtools 4.0, cmdstan 2.23.0, cmdstanr 0.0.0.9000.

Topic		Replies	Views
Speeding up CmdStanR by using more cores? General cmdstanr	15	1955	April 1, 2024
Cmdstanpy: multithreading issues (threads_per_chain) CmdStan cmdstanpy	2	525	December 13, 2023
Optimal num_stan_threads when using multiple chains General performance	5	1909	May 30, 2019
Help with naming threading argument Interfaces	22	1991	June 13, 2020
Help with reduce_sum Modeling	32	1451	August 4, 2020

Reduce_sum cores, chains, threads

Related topics