Understanding MPI and hierarchical models

Vanessa_Brown · December 13, 2019, 3:57pm

Hi all,

I’ve been using Stan for a while to fit hierarchical models, with good success (so thank you!). I recently got access to a computing cluster and have been looking into using MPI and map_rect to run my models on the cluster. The models I use generally take hours to days to run so the resulting speedup seems very useful. However, as someone new to MPI, I’m not sure I understand how the HMC estimation works with it.

My models generally estimate hierarchical data that have relatively little data at lower levels (I look at peoples’ performance on decision-making tasks, where the first level of data is choices on individual trials and the second level is subjects, where trials are nested within subjects), and so benefit greatly from partial pooling. If I run with only a few higher-level observations (i.e., only a few subjects’ data), models don’t converge or have other estimation issues. This makes me worry about splitting data into small shards with MPI because it seems that the model will not converge if it is only using a shard’s worth of data, but from my reading about MPI it seems that the information is shared among nodes during estimation so that this is not an issue. Or, do I need to make sure the amount of data in each shard is large enough to estimate the model on its own, or that enough shards are running at once to be able to share enough data across the nodes that are running? Any clarification, or direction to (relatively non-technical) resources I could use to learn about this, would be appreciated. Thank you!

wds15 · December 13, 2019, 4:04pm

The likelihood you code using MPI will be the very same without it. So nothing changes in this regard, but maybe I missed something.

Vanessa_Brown · December 13, 2019, 4:13pm

thank you! so even if the model is run on parts of the data asynchronously, it ends up affecting the log probability the same.

ETA: how does that work with warmup then?

wds15 · December 13, 2019, 4:20pm

There is no difference at all. I mean, you can write your program using map_rect and it will code exactly the same likelihood. Also: You can run the very same program with map_rect using

no parallelisation
parallelisation with threads
parallelisation with MPI

all 3 things will give you the same model being fit.

(for MPI the static data will get shipped to each node and the parameters are shipped for each leapfrog transition to the nodes while the result comes back from the nodes)

Vanessa_Brown · December 13, 2019, 5:09pm

I see, thank you for explaining. I think where part of my confusion is coming from is understanding when information is getting passed among nodes with MPI. If this is occurring at each leapfrog transition, does that mean that all the shards need to be able to run simultaneously? E.g., if I have 28 nodes, I should run a max of 28 shards (versus running more, smaller shards, but not at the same time). Or I am I still missing something here?

Bob_Carpenter · December 14, 2019, 7:42pm

Right. When map_rect is parallelized, the evaluation of the log density and its gradient are what gets parallelized. So that’s every leapfrog step.

With map_rect, you will get the same log density and gradients running in parallel using MPI, in parallel using multi-threading, or sequentially. The only thing MPI or multi-threading do is run map_rect in parallel. Stan can run map_rect in parallel safely because Stan programs have no user-facing side effects and it can synchronize the autodiff behind the scenes.

If there are N shards in a call to map_rect, then we can at most N-way parallelism. The most efficient thing to do is usually to set N to roughly the number of cores. I say “roughly” because of issues like hyperthreading and memory contention.

Vanessa_Brown · December 15, 2019, 10:22pm

That helps, thank you! I’m looking forward to trying it out.

Topic		Replies	Views
Running MPI General performance	9	1950	August 24, 2018
Map_rect/MPI & Performance with Many Shards Modeling	6	860	September 11, 2019
Map_rect causing substantial slowdown; trying to understand how to fix Modeling cmdstanr , paralellization	13	882	July 24, 2020
Efficient use of map_rect for parallelizing ODE systems CmdStan	10	776	April 28, 2020
Measure HMC Time Usage General	3	467	October 23, 2019

Understanding MPI and hierarchical models

Related topics