Optimal num_stan_threads when using multiple chains

Hey all,

I finally played around with map_rect to parallelize a random effects location scale model.
This machine has 6 physical cores, with hyperthreading (so 12 show up in htop/top).

I set num_threads to -1, and it beats the serial version of the same model (when there is enough data to warrant parallelization, anyway).
There are J groups, and I split the data into J shards. I’m playing around with J=75 or so.

I noticed that it typically outperforms the serial model when only one chain/one core is specified in rstan, but does worse when 4 chains/4 cores is specified. I assume this is because it’s creating 4*12 threads, when only 12 HT cores are available. CPU usage on each core doesn’t hit 100% in this scenario, so I assume the thread management is imposing too high a cost.

Are there any guidelines on the optimal number of threads, shards, cores, etc? Should you only choose the number of threads such that num_threads * cores = number of CPUs?


Side note: I also noticed that rstan reports VERY inflated time estimates when parallelized map_rect is used - It says the gradient eval time is much higher, and the total estimation time is much higher, than it truly is. E.g., I had a model have a true time of 120 seconds (by a stopwatch next to me), but ~1000 seconds estimated by rstan. Not a big deal, but there should probably be a big fat warning that the estimated time is probably way overestimated, or the time estimation method should be altered.

2 Likes

Thought i was having the same issue. Running Stan in Docker (DigitalOcean) with 6 vCPU (Intel Xeon Gold 6140 @ 2300Hz, 25MB cache). Was hopping to have 16 vCPUs, but even 6 doesnt get more than 66% CPU utilization.

I made sure i had the correct Makevars file:
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -flto -ffat-lto-objects -Wno-unused-local-typedefs -Wno-ignored-attributes -Wno-deprecated-declarations in $HOME/.R/Makevars
and
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = ‘-march=native’)

I get 4 chains/4 cpu at 100%, but 2 cpu’s totally unused.

This seems to be a limitation of Stan as noticed by @Mike_Terrell :

It would save a lot of time if we could split 1 chain over 2 cpu’s (or even 3) to hopefully split the sampling time. But as @Bob_Carpenter points out:

How are you hoping things would work? There is a function that lets you parallelize arbitrary parts of your Stan code

I have seen recommendations, in parallel computing literature it is suggested that the numbers used ought to be powers of 2.

Splitting the code seems to be possible:

If we used more, future would automatically dispatch different chunks of this code across different computers and then reassemble them locally. Anything you can do with furrr or future can now be done remotely.

Here’s a real-life example of using Stan to estimate a bunch of models on a cluster of two 4-CPU, 8 GB RAM machines. It uses the andrewheiss/docker-donors-ngo-restrictions Docker image because it already has Stan and family pre-installed.

He seems to be using a similar approach as this:

Stan is creating two parallel evaluations (the value and the gradient). The gradient is a tree structure that can’t be parallelized in a simple fashion so the technical part, while being worked on, does not have an obvious solution.