Optimal num_stan_threads when using multiple chains

Stephen_Martin · December 5, 2018, 8:31am

Hey all,

I finally played around with map_rect to parallelize a random effects location scale model.
This machine has 6 physical cores, with hyperthreading (so 12 show up in htop/top).

I set num_threads to -1, and it beats the serial version of the same model (when there is enough data to warrant parallelization, anyway).
There are J groups, and I split the data into J shards. I’m playing around with J=75 or so.

I noticed that it typically outperforms the serial model when only one chain/one core is specified in rstan, but does worse when 4 chains/4 cores is specified. I assume this is because it’s creating 4*12 threads, when only 12 HT cores are available. CPU usage on each core doesn’t hit 100% in this scenario, so I assume the thread management is imposing too high a cost.

Are there any guidelines on the optimal number of threads, shards, cores, etc? Should you only choose the number of threads such that num_threads * cores = number of CPUs?

Side note: I also noticed that rstan reports VERY inflated time estimates when parallelized map_rect is used - It says the gradient eval time is much higher, and the total estimation time is much higher, than it truly is. E.g., I had a model have a true time of 120 seconds (by a stopwatch next to me), but ~1000 seconds estimated by rstan. Not a big deal, but there should probably be a big fat warning that the estimated time is probably way overestimated, or the time estimation method should be altered.

eee · May 30, 2019, 12:06pm

Thought i was having the same issue. Running Stan in Docker (DigitalOcean) with 6 vCPU (Intel Xeon Gold 6140 @ 2300Hz, 25MB cache). Was hopping to have 16 vCPUs, but even 6 doesnt get more than 66% CPU utilization.

I made sure i had the correct Makevars file:
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -flto -ffat-lto-objects -Wno-unused-local-typedefs -Wno-ignored-attributes -Wno-deprecated-declarations in $HOME/.R/Makevars
and
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = ‘-march=native’)

I get 4 chains/4 cpu at 100%, but 2 cpu’s totally unused.

This seems to be a limitation of Stan as noticed by @Mike_Terrell :

It would save a lot of time if we could split 1 chain over 2 cpu’s (or even 3) to hopefully split the sampling time. But as @Bob_Carpenter points out:

sakrejda · May 30, 2019, 12:52pm

How are you hoping things would work? There is a function that lets you parallelize arbitrary parts of your Stan code

increasechief · May 30, 2019, 1:00pm

I have seen recommendations, in parallel computing literature it is suggested that the numbers used ought to be powers of 2.

eee · May 30, 2019, 1:10pm

Splitting the code seems to be possible:

If we used more, future would automatically dispatch different chunks of this code across different computers and then reassemble them locally. Anything you can do with furrr or future can now be done remotely.

Here’s a real-life example of using Stan to estimate a bunch of models on a cluster of two 4-CPU, 8 GB RAM machines. It uses the andrewheiss/docker-donors-ngo-restrictions Docker image because it already has Stan and family pre-installed.

He seems to be using a similar approach as this:

sakrejda · May 30, 2019, 1:21pm

Stan is creating two parallel evaluations (the value and the gradient). The gradient is a tree structure that can’t be parallelized in a simple fashion so the technical part, while being worked on, does not have an obvious solution.

Topic		Replies	Views
Map_rect, rstan, and multiple chains RStan rstan	3	895	January 22, 2019
Map_rect spawns too many threads than requested Modeling rstan , performance	13	804	January 25, 2021
Map_rect concurrent about to land Developers math	34	2118	July 23, 2018
Map-Reduce examples? Modeling	5	1270	March 2, 2019
Performance of Parallel Computation with map_rec() General	3	514	September 20, 2019

Optimal num_stan_threads when using multiple chains

Related topics