I am trying to test mpi parallelism as described here on a (slurm) cluster, but the test ends in an error.
This is the procedure I used:
Download and untar cmdstan 2.18.0 and load gcc/6.3.0 and openmpi.gnu/2.1.0.
Then:
cd cmdstan-2.18.0
make build -j4
cd stan/lib/stan_math/
// add lines `STAN_MPI=true` and `CXX=mpicxx` to make/local in the folder stan_math
make clean-all
./runTests.py test/unit/math/prim/mat/functor/map_rect_test.cpp
At the end of the test i get the error
make: clang++: Command not found
make: *** [bin/math/prim/arr/functor/mpi_cluster_inst.o] Error 127
make -j1 test/unit/math/prim/mat/functor/map_rect_test failed
The changes you made to stan/lib/stan_math/make/local need to be made to make/local which sits in the cmdstan root. Sorry if that was not clear from the instructions. To be double safe, please also set CC=mpicxx (the new makefiles in develop solve all this).
… but wait, for the stan-math tests you need to edit the stan/lib/stan_math/make/local as you did (like I say, maybe add CC=mpicxx). Each make/local is for each module (cmdstan and stan-math). Both should align.
If you are ok with using the develop version, then you may want to try out the cmdline develop version. I am affraid that there is another makefile bug in the released 2.18. That one is fixed in the develop.
In a first attempt I had made the changes in make/local, I never tried to to change both make files …
The reason I ended up changing stan/lib/stan_math/make/local is that runTests.py is located in stan/lib/stan_math/.
I think it would be usefull to add to the wiki in which folder one needs to be when doing make clean-all (I now assume make/local, but I am not sure) and when using runTests.py.
I am happy to use the develop version and assume you refer to cmdstan when you write “cmdline develop version”.
Oh… no. make clean-all is always called in the top-level directory of whatever module you are in. The make sub-folder is only for configuring make.
Yes, please use the develop version of cmdstan if that is an option. The makefile issues are all sorted there. In case you want to use the released 2.18 one - I think I wrote on discourse somewhere how to make it work, but the current develop works out of the box to my knowledge.
Still, after installing the develop version of cmdstan I get the following error:
Traceback (most recent call last):
File “./runTests.py”, line 177, in
main()
File “./runTests.py”, line 173, in main
runTest(t, inputs.run_all, mpi = stan_mpi, j = inputs.j)
File “./runTests.py”, line 123, in runTest
command = “mpirun -np {} {}”.format(j, command)
ValueError: zero length field name in format
The messages also indicate that boost was successfully built:
The Boost C++ Libraries were successfully built!
The following directory should be added to compiler include paths:
/cluster/home/guidopb/software/cmdstan-develop/stan/lib/stan_math/lib/boost_1.66.0
The following directory should be added to linker library paths:
/cluster/home/guidopb/software/cmdstan-develop/stan/lib/stan_math/lib/boost_1.66.0/stage/lib
Download zip files of develop versions for cmdstan, stan, and the math library from github, unzip and put stan in cmdstan/stan and math into cmdstan/stan/lib/stan_math
Put lines STAN_MPI=true and CXX=mpicxx into make/local and cmdstan/stan/lib/stan_math/make/local
In the top cmdstan directory:
Not sure about the python version… can you build the bernoulli example in cmdstan?
This thread, Linear, parallell regression, has a lot of example code in it which runs Stan with MPI (ok, we used threading there, but that does not really matter).
I got threading to work (using Richard McElreath’s tutorial). But the performance hit per core seemed severe, so j wanted to try MPI.
I am now able to compile the models, the problem was indeed with the python script. I haven’t solves this, but when I compile and run the the first math test manually, it turns out OK.
I can also compile the binomial model and will try to compare threading and MPI later this week.
Here are some results using model and data from Richard McElreath’s tutorial on threading. The model is a logistic regression with N = 125,000 and K = 2.
I did the tests on a slurm-cluster with dual Intel E5-2670 (Sandy Bridge) processors running at 2.6 GHz, organised in computing nodes with 16 cores. As compiler I used gcc/6.3.0, for MPI I used openmpi/2.1.0.
For a reliable comparison I would run each model multiple times, but for now I am reporting times from one run only:
Standard logitstic regression:
( time ./logistic0 sample num_warmup=500 num_samples=500 data file=redcard_input.R )
real 2m21.740s
user 2m21.695s
sys 0m0.025s
Threading with 19 shards:
export STAN_NUM_THREADS=-1
time ./logistic2 sample num_warmup=500 num_samples=500 data file=redcard_input.R
real 1m43.959s
user 14m33.019s
sys 2m18.842s
MPI with 19 shards:
( time mpirun -np 16 ./logistic2mpi sample num_warmup=500 num_samples=500 data file=redcard_input.R )
real 1m26.552s
user 0m0.027s
sys 0m0.075s
Threading resulted in 30% faster sampling (at the cost of using 16 instead of one core)
MPI resulted in 40% faster sampling (at the cost of using 16 instead of one core).
MPI is, as predicted by Sebastian, better. But It seems also clear that fitting a logistic regression of this size with MPI or Threading does not help a lot, presumably because vectorisation makes the the bernoulli_lpmf already pretty efficient and one loses some of the benefits of vectorisation when using MPI or Threading (as Sebastian also mentioned somewhere on discourse.)
The vectorization in Stan is really efficient! If you want to see speedups for this case, then recode the model without vectorization - just for the fun of it.
Generally, I would never recommend MPI/threading for models which run only a few minutes (unless the structure of the model is really ideal for the approach like ODE things). Once you start to get to >15min running time (it really depends on the case) then you can start to throw in map_rect. The real killer application (from my view) is that map_rect gives us a very important property which is scalability of the performance. So your data grows to large sizes - no problem, just throw in more hardware. Of course, at some point even a map_rect approach won’t scale enough any more (but then you maybe can switch to a GPU hopefully).
I totally agree.
My motivations here is that I have regression models that take days to fit. Now I wanted to get some experience with map_rect, threading and MPI before trying the harder problems.
One situation where map_rect should be useful for models that do not take long to fit is, when the vectorized ..._lpdf cannot be much faster than non-vectorized version. I think this is for example the case for the beta-binomial.
I’ll try such models later an will report back here.