Cmdstan 2.18 MPI

Hi,

I wonder if MPI speeds up the things in Win 10 or should I go to Linux?

After reworking Bayesian Neural Network model to MPI I observed severe slowdown on Win 10.

It would be great if somebody commented how to improve the model and tips how to run on Linux? I have 126000 observations so wonder how many shards will make a difference.

NNsigma.2.18.mpi.stan (3.2 KB)

Don’t think MPI in Win is supported:

The cmdstan-manual gives some syntax for running models in Linux.

Thanks a lot. I think I have used your example as a basis when coding the BNN model. I wonder if multiple append_row are efficient enough or there is a better way? The model is included in my previous post. I am very curious how to define # of shards based on # of nodes in the cluster.

Thanks for sharing! I’m planning to do something similar to the append_row.-part. But think @wds15 could better answer wrt performance.

I am specifically interested how to use shards in academic cluster comprised of dual 10-core Xeon. OS=CentOS7.

For some reason asking PBS for 5 nodes with 20 cores each didn’t give any improvement over 1 node with 20 cores since cmdstan implementation uses only the first node and number of cores specified by STAN_NUM_THREADS.

Thus I have two questions.

  1. How to make cmd stan to utilize all available cores?

  2. What is more effective:
    phi[1] = bias_first_h;
    phi[2] = beta_first_h;
    phi[3] = beta_output_h;
    phi[4] = sigma;
    phi = append_row(bias_first,
    append_row(bias_output, append_row(to_vector(beta_first), append_row(beta_output, phi))));

OR

phi[1:num_nodes] = bias_first;
phi[num_nodes + 1] = bias_output;
phi[start_beta_first:end_beta_first] = to_vector(beta_first);
phi[start_beta_output:end_beta_output] = beta_output;
phi[end_beta_output + 1] = bias_first_h;
phi[end_beta_output + 2] = beta_first_h;
phi[end_beta_output + 3] = beta_output_h;
phi[end_beta_output + 4] = sigma;

OR something else

STAN_NUM_THREADS is irrelevant for MPI.

You should probably first figure out how you are supposed to call MPI programs on your cluster. You admin should tell you. This usually involves to call your Stan program via mpirun or mpiexec (depends on the MPI details).

Don’t worry so much about these options above.

There’s a map primitive called map_rect that can be used with MPI or with C++11 threads, and all the thread options refer to the latter implementation (which runs on a single machine). MPI requires that you compile your program in a special way (should tell you how in the manual) and then run it on your cluster using one of the tools Sebastian mentioned, depending on which flavor of the MPI runtime/library you have installed on the cluster. For example, the Columbia University cluster has some instructions here and hopefully your cluster has some similar instructions or a help desk you can reach. Should be a common ask.

Thanks for the suggestions.

I still have some questions.

  1. Is the right way to compile cmd stan with CXXFLAGS += -DSTAN_THREADS -pthread in make/local?

Probably I didn’t hit the right manual because most of my knowledge is based on Linear, parallell regression thread.

I compiled cmdstan with CXXFLAGS += -DSTAN_THREADS -pthread in make/local as was suggested in the thread.

Yes, I used map_rect and it works. I was able to run 10, 15, and 20 shards and actually checked with htop that corresponding # of threads are running. I didn’t use mpirun though. The command line I was using was:
export STAN_NUM_THREADS=15
time ./NNsigma.2.18.mpi sample …

However, when I tried 100 shards I found that only first node is active. All others were idle.

Obviously I tried mpirun with 20 shards. Here the output about the progress was repeated 20 times.

Please use either threading OR MPI - using both at the same time may work; that depends on MPI and thread details of the platform.

the STAN_THREADS define is needed when you want to use threading. Setting STAN_MPI=true in make/local when you want to use MPI.

From what you describe when running under MPI sounds as if MPI was not enabled right… which can easily be a consequence of buggy makefiles in 2.18 (sorry for that). These are fixed by now on the develop branch of cmdstan (or just grab from develop make/models).

Now, when I used shards=100 and started my job with mpiexec ./foo I see 20 instances of foo on each node. Is this expected behavior?

My goal is to cut down 14h on 1 node 20 cores to 4h on 5 nodes 20 cores each. Since I have 12600 observations (240 inputs 1 output) should I use MPI with mpi_rect and 100 shards? Does threading help in this case?

I recommend you to start with threading… then you do not need to bother with MPI stuff. Once that works with 20 cores, then move to MPI. And yes, the point of MPI is that when you use 100 shards and have 20 cores machines, then you should end up using 5 machines each with 20 cpus being in use at the same time.

Yes, I started with threading. Since I am not an expert I compiled the following notes (I have gcc version 4.8.5). Can you please glance at them in case I made a mistake?

I did the following steps to recompile cmdstan:
make clean-all
make/local contains CXXFLAGS += -DSTAN_THREADS -pthread
cd cmdstan/make
git fetch
git checkout develop models
cd …
make build
make mpi10/foo
cd mpi10

in the job submission script I have used:
export STAN_NUM_THREADS=10
time ./foo sample …

I have set shards to 10.

On the node that was assigned to me by PBS I found only one foo process. Only 10 cores were used. Is this as expected?

I ran the test in Linear, parallell regression and the timing is almost linear (I have omitted the seconds):
2 shards 34m
4 shards 16m
5 shards 14m
10 shards 6m

I also added a separate folder for mpi version of cmdstan (eventually I have to use multiple nodes). Since I was getting
MPI auto-detection failed: unknown wrapper compiler mpic++
I have assembled the following notes to build cmdstan:
git clone https://github.com/stan-dev/cmdstan.git --recursive
make clean-all

for treading support
make/local contains
CXXFLAGS += -DSTAN_THREADS -pthread

for MPI support
make/local contains
STAN_MPI=true
CC=mpicxx
stan/lib/stan_math/lib/boostxxx/user-config.jam contains
using mpi : /apps/cent7/intel/impi/2017.1.132/bin64/mpicxx ;
instead just using mpi ;

cd cmdstan/make
git fetch
git checkout develop models
cd …
make build

I now I have submitted the job with:
time mpiexec -n 100 ./foo sample…

On each node I had 20 foo instances. Is this as expected?

Yes, that is expected, of course.

What you could probably do before submitting to the queue a 100 core job is to manually call in a shell

mpiexec -n 5 …
mpiexec -n 10 …

your program and see if that has the desired effect (maybe with the linear example which seems to work for you with threading).

Looks to me it is working for you, no?

It looks it is working. The MPI results were interesting:
5 nodes 20 cores each 100m
10 nodes 20 cores each 122m
15 nodes 20 cores each 132m

It means more is not better. I’ll keep trying.

I have a question. When I compiled cmdstan boost library build step gave:
MPI auto-detection failed: unknown wrapper compiler mpic++

so I had to do some tweaking:
stan/lib/stan_math/lib/boostxxx/user-config.jam contains
using mpi : /apps/cent7/intel/impi/2017.1.132/bin64/mpicxx ;
instead just using mpi ;
and recompiled cmdstan. This time MPI was not ignored. But Boost told me:
The following directory should be added to compiler include paths:
/home/lmockus/mpi/cmdstan/stan/lib/stan_math/lib/boost_1.66.0
The following directory should be added to linker library paths:
/home/lmockus/mpi/cmdstan/stan/lib/stan_math/lib/boost_1.66.0/stage/lib

I forgot to do it and ran my experiment with 5,10,15 nodes. Did I make a mistake? Or perhaps there are boost libraries on the system?

No, it should work. We are hard-coding the path’s to the dynamic libraries during compilation (at least we try).

What about the tweaking I did re mpic++? Is there a more elegant approach?

Does it make sense that:
3 nodes 20 cores each take 148m
5 nodes 20 cores each - 100m
10 nodes 20 cores each - 122m
15 nodes 20 cores each - 132m

Maybe… don’t know. What you did is what boost recommends in case your system is not easily auto-detectable. That can mean many things.

Depends on your model and the data you are fitting… and the hardware this is running on. We have Infiniband on our cluster which should give very good scaling. On standard ethernet fabric scaling will be much worse.

… but what is your 1 core running time without MPI?

I have time on Win 10 which is 92345s - no map_rect. Unfortunately Win 10 is faster so I still have to do comparison but my guess improvement is obvious. I will reformat my PBPK problem to mpi to see if it becomes solvable in less than a day. Fortunately I have 19 subjects so I node with 20 cores will be enough. Maybe in this case just threading will be sufficient?

@linas thanks for sharing this model, as well as the original.

I’m trying to understand this. The original model is pretty straightforward, just using this an example to figure out the MPI stuff.

You’re introducing two new things, not included in the first model.

Can you please explain the bias corrections you included, mathematically?
Also, what do you mean by shards? Is this you chopping up the parameters to allocate on different cores, or what? I get nodes and laters, but now I’m thinking of broken glass or something

thanks!

Threading will already bring you very far. If you want the last bit of performance, then go with MPI… which doesn’t work on Windows though. So MPI is more efficient than threading.

1 Like