Here is a comparison of speed gains through MPI and threading in Stan:
I compared the models by fitting beta-binomial regression models with
K = 10 predictors and
N = 1000, 5000, 10000, and 20000 rows. The basic model used one core, threading and MPI used 4, 8 or 16 cores, each time the same number of shards as cores. I did not further investigate the optimal number of shards!
The basic model took 63, 320, 661, 1267 seconds for
N = 1000, 5000, 10000, and 20000, respectively (1000 warmup and 1000 post warmup samples). The table shows the proportion of the time of the basic model the threading and MPI analyses took. 1
|4 shards, MPI
|4 shards, Threading
|8 shards, MPI
|8 shards, Threading
|16 shards, MPI
|16 shards, Threading
I appreciate any feedback.
@wds15 I remember you had mentioned a reason that threads were slower than MPI but I forget - did it have to do with the way the MPI implementation does a more efficient autodiff node construction? Is it worth making an issue to port that to threads?
I think that the MPI version is faster as it does not rely on a
thread_local ad stack. The problem with the
thread_local thing is that we rely on the compiler implementation to make this thing happen. For example,
clang++ do not implement a thread pool approach for the
async facility which we use. So I would hope that a better implementation of this (using a thread pool) should make the performance gap of MPI and threading smaller. To test this I wanted to make a few runs using Intels
icpc compiler as this one is supposed to have a good
async implementation. However, right now the Intel compilers are not up to compile stan programs as we trigger some weird compiler bugs.
The reason as to why I think that a thread pool will help us is that MPI is effectively a thread pool when executed on a given machine.
I have been having an issue getting a meaningful performance improvement from MPI, compared to threading on Cmdstan 2.18.1. My model is an ODE model fitted to hierarchiral data (18 subjects as a test).
The puzzling thing is that there is a significant improvement during the gradient evalution stage: 6.04 seconds (for threading) comapred to 0.25 seoconds (for MPI). However, the total run time (100 warmup and 10 sampling as a test) is faster for threading on a single core (2.8GHz) on my local system (~11hrs) than for MPI using 160 cores (320 cores with hyperthreading; 2.4GHz per core), spread across 4 nodes (~13hrs). I checked for the core usage and I have confirmed that all 320 cores were in use at 100% capacity.
I’m wondering if there is something wrong with the MPI configuration on my cluster system, but I find it strange that my program runs fine (except for the speed) in the MPI setup as well. Since my experience of working with servers is limited, I got assitance from a system admin to set up MPI after reading the instruction (https://github.com/stan-dev/math/wiki/MPI-Parallelism) as follows:
# Load the modules
module load gcc/7.3.0 openmpi/3.1.1 python/2.7.15 git
# Clone the developer version
git clone https://github.com/stan-dev/cmdstan.git —recursive
# Modify make/local
echo "CXXFLAGS+=-pthread" > ./make/local
echo "LDLIBS+=-lpthread" >> ./make/local
echo "STAN_MPI=true" >> ./make/local
echo "CXX=mpicxx" >> ./make/local
# build stan
make build -j20
# compile my_model
The cluster uses Slurm for scheduling and my jobscript looks as follows
#SBATCH --job-name stan
module load gcc/7.3.0 openmpi/3.1.1 python/2.7.15 git
mpirun --oversubscribe ./my_model method=sample num_warmup=100 num_samples=10 data file=my_input.R output file=$SLURM_SUBMIT_DIR/output.csv
I would be very grateful if you could give me some pointers on this
Not sure what “–oversubscribe” does, but that does not sound useful.
- avoid the use of hyper threading to start with
- only run this thing on a single computer, but with multiple processes
- 320 cores does sound a lot to me. I used once 80 cores for a 3 day run and got things down to 1h runtime… but 320 cores sounds way to much, I think
- Also… how do you split your data into shards? If you have 18 subjects then split by subject then map_rect can only take advantage of 18 cores. Not more.
I hope that helps.
You gave me a lot of food for thought.
I split my data using map_rect such that each shard computes the likelihood per subject. I thinking I was overlooking the point that it is pointless to add cores beyond the number of shards.
Also, I’m wondering if my observation is a testament to how well threading performs as long as the number of shards is not too high.
I’m still a little curious why I see that gradient evaluation is much faster on MPI than threading, however.
Thank you for your input.
MPI (right now) is faster than threading by about 20% for internal reasons (when all things are being equal).
The speed penalty for threading may go away eventually.
I am pretty happy with threading, I use 10x3 threads on a 46 cores computer (I have a hierarchical model with 60K elements, each of them having 100 replicates already vectorised).
Since we have a HPC, I would like to heavily increase the amount of threads across machines. For doing this I guess I need to use MPI and NOT threading.
If for my threading I define
CXX14FLAGS += -O3
CXX14FLAGS += -DSTAN_THREADS
CXX14FLAGS += -pthread
How can I specify labels to use MPI instead (I tried to read some doc/posts but I would like to avoid naive errors as much as possible)?
First: You need CmdStan for getting MPI to work.
Unfortunately there is a bug in the released MPI makefiles. You may grab the file here:
models.txt (1.2 KB)
and overwrite the
make/models file with this one. This should fix the makefile issue.
make/local will most likely look like this for you:
And that should be it! The makefiles are fixed in the current develop version (and as such will be fine with the next 2.19).
I am glad to hear
map_rect with parallelism helps in your applications.
Will I be able to use rstan in the future for MPI (i.e., ETA; not threading)?
It is a killer in genetics!
I am really not sure what plans are from @bgoodri in this regard, but I doubt that MPI can be used under R in a straightforward way. However, you can always run chains with cmdstan and read them into R using
read_stan_csv. As a little motivation: You can expect that your program will run faster with MPI parallelism with the 2.18 version of Stan. The speedup depends on your program specifics and can range between 10%-65%!
map_rect is a killer in PK/PD modelling as well… hours instead of days inference times makes a difference in the life of a modeller.