Optimal num_stan_threads when using multiple chains



Hey all,

I finally played around with map_rect to parallelize a random effects location scale model.
This machine has 6 physical cores, with hyperthreading (so 12 show up in htop/top).

I set num_threads to -1, and it beats the serial version of the same model (when there is enough data to warrant parallelization, anyway).
There are J groups, and I split the data into J shards. I’m playing around with J=75 or so.

I noticed that it typically outperforms the serial model when only one chain/one core is specified in rstan, but does worse when 4 chains/4 cores is specified. I assume this is because it’s creating 4*12 threads, when only 12 HT cores are available. CPU usage on each core doesn’t hit 100% in this scenario, so I assume the thread management is imposing too high a cost.

Are there any guidelines on the optimal number of threads, shards, cores, etc? Should you only choose the number of threads such that num_threads * cores = number of CPUs?

Side note: I also noticed that rstan reports VERY inflated time estimates when parallelized map_rect is used - It says the gradient eval time is much higher, and the total estimation time is much higher, than it truly is. E.g., I had a model have a true time of 120 seconds (by a stopwatch next to me), but ~1000 seconds estimated by rstan. Not a big deal, but there should probably be a big fat warning that the estimated time is probably way overestimated, or the time estimation method should be altered.