Here is a comparison of speed gains through MPI and threading in Stan:
Summary:
I compared the models by fitting beta-binomial regression models with K = 10 predictors and N = 1000, 5000, 10000, and 20000 rows. The basic model used one core, threading and MPI used 4, 8 or 16 cores, each time the same number of shards as cores. I did not further investigate the optimal number of shards!
The basic model took 63, 320, 661, 1267 seconds for N = 1000, 5000, 10000, and 20000, respectively (1000 warmup and 1000 post warmup samples). The table shows the proportion of the time of the basic model the threading and MPI analyses took. 1
@wds15 I remember you had mentioned a reason that threads were slower than MPI but I forget - did it have to do with the way the MPI implementation does a more efficient autodiff node construction? Is it worth making an issue to port that to threads?
I think that the MPI version is faster as it does not rely on a thread_local ad stack. The problem with the thread_local thing is that we rely on the compiler implementation to make this thing happen. For example, g++ and clang++ do not implement a thread pool approach for the async facility which we use. So I would hope that a better implementation of this (using a thread pool) should make the performance gap of MPI and threading smaller. To test this I wanted to make a few runs using Intels icpc compiler as this one is supposed to have a good async implementation. However, right now the Intel compilers are not up to compile stan programs as we trigger some weird compiler bugs.
The reason as to why I think that a thread pool will help us is that MPI is effectively a thread pool when executed on a given machine.
I have been having an issue getting a meaningful performance improvement from MPI, compared to threading on Cmdstan 2.18.1. My model is an ODE model fitted to hierarchiral data (18 subjects as a test).
The puzzling thing is that there is a significant improvement during the gradient evalution stage: 6.04 seconds (for threading) comapred to 0.25 seoconds (for MPI). However, the total run time (100 warmup and 10 sampling as a test) is faster for threading on a single core (2.8GHz) on my local system (~11hrs) than for MPI using 160 cores (320 cores with hyperthreading; 2.4GHz per core), spread across 4 nodes (~13hrs). I checked for the core usage and I have confirmed that all 320 cores were in use at 100% capacity.
I’m wondering if there is something wrong with the MPI configuration on my cluster system, but I find it strange that my program runs fine (except for the speed) in the MPI setup as well. Since my experience of working with servers is limited, I got assitance from a system admin to set up MPI after reading the instruction (https://github.com/stan-dev/math/wiki/MPI-Parallelism) as follows:
# Load the modules
module load gcc/7.3.0 openmpi/3.1.1 python/2.7.15 git
# Clone the developer version
git clone https://github.com/stan-dev/cmdstan.git —recursive
cd cmdstan
# Modify make/local
echo "CXXFLAGS+=-pthread" > ./make/local
echo "LDLIBS+=-lpthread" >> ./make/local
echo "STAN_MPI=true" >> ./make/local
echo "CXX=mpicxx" >> ./make/local
# build stan
make build -j20
# compile my_model
make my_model_dir/my_model
The cluster uses Slurm for scheduling and my jobscript looks as follows
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks=320
#SBATCH --time=15:45:00
#SBATCH --job-name stan
#SBATCH --output=run_report_%j.txt
module load gcc/7.3.0 openmpi/3.1.1 python/2.7.15 git
cd my_model_dir
mpirun --oversubscribe ./my_model method=sample num_warmup=100 num_samples=10 data file=my_input.R output file=$SLURM_SUBMIT_DIR/output.csv
I would be very grateful if you could give me some pointers on this
Not sure what “–oversubscribe” does, but that does not sound useful.
I would
avoid the use of hyper threading to start with
only run this thing on a single computer, but with multiple processes
320 cores does sound a lot to me. I used once 80 cores for a 3 day run and got things down to 1h runtime… but 320 cores sounds way to much, I think
Also… how do you split your data into shards? If you have 18 subjects then split by subject then map_rect can only take advantage of 18 cores. Not more.
I split my data using map_rect such that each shard computes the likelihood per subject. I thinking I was overlooking the point that it is pointless to add cores beyond the number of shards.
Also, I’m wondering if my observation is a testament to how well threading performs as long as the number of shards is not too high.
I’m still a little curious why I see that gradient evaluation is much faster on MPI than threading, however.
I am pretty happy with threading, I use 10x3 threads on a 46 cores computer (I have a hierarchical model with 60K elements, each of them having 100 replicates already vectorised).
Since we have a HPC, I would like to heavily increase the amount of threads across machines. For doing this I guess I need to use MPI and NOT threading.
I am really not sure what plans are from @bgoodri in this regard, but I doubt that MPI can be used under R in a straightforward way. However, you can always run chains with cmdstan and read them into R using read_stan_csv. As a little motivation: You can expect that your program will run faster with MPI parallelism with the 2.18 version of Stan. The speedup depends on your program specifics and can range between 10%-65%!
Awesome!
map_rect is a killer in PK/PD modelling as well… hours instead of days inference times makes a difference in the life of a modeller.