Feedback on installing and using MPI on a computer cluster

I wanted to give feedback on my use of cmdstan 2.18.0 with MPI on a relatively big computer cluster. I also detail the solutions to some problems I encountered during installation, which might be of use to other people like me (not that confortable with this kind of stuff). The computer cluster I have access to belongs to the University of Bern ( and runs on CentOS 7.5.1804.


Installation was a bit difficult, even with the help of the cluster admins. It’s a bit over my head but there were at first problems with conflicting versions of Boost, and then a missing header file (pyconfig.h) that was eventually solved by installing the python-devel package. So overall what worked for me was:

  • loading the module with the right version of Boost
  • installing python-devel
  • creating a file named cmdstan-2.18.0/make/local with the following lines:
  • running make build -j24

Compiling models

The installed cmdstan version was then able to compile example models in “examples/bernouilli”. I then wated to make some tests using Daniel Lee’s ODE examples ( At first I could not compile models that include map_rect(), because of undefined reference to 'pthread_create'. After a bit of googling, I managed to find a solution by adding LDFLAGS= -pthread -lpthread to make/local.

Time gains

I tried to sample one chain of 2000 iterations from Daniel’s SHO model including parallelization with a single node (24 CPUs, 64GB ram). Sampling took {405,422,424,446,427} seconds without MPI (sho_fit_multiple.stan), and with MPI (sho_fit_multiple_parallel.stan) that time was reduced to {322,330,309,328,335} seconds. So on average a reduction of 24%.

It is a bit less than I expected, but I’m not completely sure whether I’m using MPI at its maximum here, or if there is a way to monitor CPU usage. My next step is to implement map_rect() in my own model that has more data, more complex ODEs and a hierarchical structure. Maybe the difference will be more important in that case. If there is any interest I will continue to post the results here.

Anyway I’d like to thank the Stan developers for implementing such cutting-edge methods!


Thanks very much for sharing this. I had been trying to get this running on an up-to-date Ubuntu 18.04 system and encountering the following error (pasted here in case it helps anyone else with the problem find this thread):

/usr/bin/ld: stan/lib/stan_math/lib/gtest_1.7.0/src/gtest-all.o: undefined reference to symbol 'pthread_key_delete@@GLIBC_2.2.5'

The pthread-related LDFLAGS were the missing link. For completeness, my working make/local looks like the following:

LDFLAGS += -pthread -lpthread 

Same as yours, except CXX instead of CC. I saw this was recently changed on the MPI wiki page, and this issue suggests that (some? all?) bits of Stan now use CXX instead (although neither the history on that issue nor a Discourse search make it clear that that is universally true so beware… I’m just a C++ fearing, cargo-cult monkey, bashing my keyboard until things mysteriously work). I am also not sure if the MPI wiki page should be updated with this LDFLAGS, or if this is just an issue specific to some systems.

Anyway, thanks again for sharing your solution. And to the ever-amazing Stan dev team for bringing this feature. Ever since I heard @Bob_Carpenter talk about MPI-powered map_rect in Helsinki, I’ve been dying to get this working. It is a real game-changer!

I’m glad I could be useful!