I’m a little confused, so sorry if this is a stupid question ;).
In “Reduce Sum: A Minimal Example” it also says to set the number of chains*threads (4 * 2 in the example) equal to the number of physical cores ( 8 in the example), but the number of cores is left equal to the number of chains (4 in the example). Should the number of cores have been set to 8? Or is there something special going on on that example?
I get no speed-up at all, following Reduce Sum: A Minimal Example. I tried my local machine and a cluster of cores, different values for chains, cores, threads, but I rarely get any speed-up at all, and never anywhere close to the 2.7 speed-up in the case study.
I’m able to replicate the case study, but my local machine only has 4 cores, so I updated to cores=1, chains=1, and set_num_threads(4). I had a speedup of ~2.5 with that setup.
thanks @djgustafson ! unfortunately this didn’t work for me either, still unable to replicate the speed-up in the case study (or any speed-up at all !).
I was able to replicate the case study on a home assembled PC built around an Intel i9-7960X cpu (16 cores, 32 threads). I followed the case study exactly except I set the number of threads to 4 per chain. The timings were - unthreaded base model:
so about a factor 3.6 improvement in total execution time. I haven’t experimented much with different chain/thread combinations except there appears to be no incremental improvement with chains*threads > number physical cores.
I’m running Windows 10 with R 4.0.0, Rtools 4.0, cmdstan 2.23.0, cmdstanr 0.0.0.9000.