Comparing threading and MPI


Here is a comparison of speed gains through MPI and threading in Stan:

I compared the models by fitting beta-binomial regression models with K = 10 predictors and N = 1000, 5000, 10000, and 20000 rows. The basic model used one core, threading and MPI used 4, 8 or 16 cores, each time the same number of shards as cores. I did not further investigate the optimal number of shards!

The basic model took 63, 320, 661, 1267 seconds for N = 1000, 5000, 10000, and 20000, respectively (1000 warmup and 1000 post warmup samples). The table shows the proportion of the time of the basic model the threading and MPI analyses took. 1

analysis 1000 5000 10000 20000
4 shards, MPI 0.46 0.42 0.41 0.39
4 shards, Threading 0.40 0.45 0.65 0.52
8 shards, MPI 0.35 0.22 0.20 0.22
8 shards, Threading 0.53 0.43 0.41 0.30
16 shards, MPI 0.22 0.12 0.12 0.12
16 shards, Threading 0.22 0.15 0.14 0.13

I appreciate any feedback.


@wds15 I remember you had mentioned a reason that threads were slower than MPI but I forget - did it have to do with the way the MPI implementation does a more efficient autodiff node construction? Is it worth making an issue to port that to threads?


I think that the MPI version is faster as it does not rely on a thread_local ad stack. The problem with the thread_local thing is that we rely on the compiler implementation to make this thing happen. For example, g++ and clang++ do not implement a thread pool approach for the async facility which we use. So I would hope that a better implementation of this (using a thread pool) should make the performance gap of MPI and threading smaller. To test this I wanted to make a few runs using Intels icpc compiler as this one is supposed to have a good async implementation. However, right now the Intel compilers are not up to compile stan programs as we trigger some weird compiler bugs.

The reason as to why I think that a thread pool will help us is that MPI is effectively a thread pool when executed on a given machine.


I have been having an issue getting a meaningful performance improvement from MPI, compared to threading on Cmdstan 2.18.1. My model is an ODE model fitted to hierarchiral data (18 subjects as a test).

The puzzling thing is that there is a significant improvement during the gradient evalution stage: 6.04 seconds (for threading) comapred to 0.25 seoconds (for MPI). However, the total run time (100 warmup and 10 sampling as a test) is faster for threading on a single core (2.8GHz) on my local system (~11hrs) than for MPI using 160 cores (320 cores with hyperthreading; 2.4GHz per core), spread across 4 nodes (~13hrs). I checked for the core usage and I have confirmed that all 320 cores were in use at 100% capacity.

I’m wondering if there is something wrong with the MPI configuration on my cluster system, but I find it strange that my program runs fine (except for the speed) in the MPI setup as well. Since my experience of working with servers is limited, I got assitance from a system admin to set up MPI after reading the instruction ( as follows:

# Load the modules
module load gcc/7.3.0 openmpi/3.1.1 python/2.7.15 git

# Clone the developer version
git clone —recursive

cd cmdstan

 # Modify make/local
 echo "CXXFLAGS+=-pthread" > ./make/local
 echo "LDLIBS+=-lpthread" >> ./make/local
 echo "STAN_MPI=true" >> ./make/local
 echo "CXX=mpicxx" >> ./make/local

 # build stan
 make build -j20

 # compile my_model
 make my_model_dir/my_model

The cluster uses Slurm for scheduling and my jobscript looks as follows

 #SBATCH --nodes=4
 #SBATCH --ntasks=320
 #SBATCH --time=15:45:00
 #SBATCH --job-name stan
 #SBATCH --output=run_report_%j.txt

 module load gcc/7.3.0 openmpi/3.1.1 python/2.7.15 git

 cd my_model_dir

 mpirun --oversubscribe ./my_model method=sample num_warmup=100 num_samples=10 data file=my_input.R output file=$SLURM_SUBMIT_DIR/output.csv

I would be very grateful if you could give me some pointers on this

Thank you,


Not sure what “–oversubscribe” does, but that does not sound useful.

I would

  • avoid the use of hyper threading to start with
  • only run this thing on a single computer, but with multiple processes
  • 320 cores does sound a lot to me. I used once 80 cores for a 3 day run and got things down to 1h runtime… but 320 cores sounds way to much, I think
  • Also… how do you split your data into shards? If you have 18 subjects then split by subject then map_rect can only take advantage of 18 cores. Not more.

I hope that helps.


You gave me a lot of food for thought.

I split my data using map_rect such that each shard computes the likelihood per subject. I thinking I was overlooking the point that it is pointless to add cores beyond the number of shards.

Also, I’m wondering if my observation is a testament to how well threading performs as long as the number of shards is not too high.

I’m still a little curious why I see that gradient evaluation is much faster on MPI than threading, however.

Thank you for your input.


MPI (right now) is faster than threading by about 20% for internal reasons (when all things are being equal).

The speed penalty for threading may go away eventually.