Understanding MPI and hierarchical models

Hi all,

I’ve been using Stan for a while to fit hierarchical models, with good success (so thank you!). I recently got access to a computing cluster and have been looking into using MPI and map_rect to run my models on the cluster. The models I use generally take hours to days to run so the resulting speedup seems very useful. However, as someone new to MPI, I’m not sure I understand how the HMC estimation works with it.

My models generally estimate hierarchical data that have relatively little data at lower levels (I look at peoples’ performance on decision-making tasks, where the first level of data is choices on individual trials and the second level is subjects, where trials are nested within subjects), and so benefit greatly from partial pooling. If I run with only a few higher-level observations (i.e., only a few subjects’ data), models don’t converge or have other estimation issues. This makes me worry about splitting data into small shards with MPI because it seems that the model will not converge if it is only using a shard’s worth of data, but from my reading about MPI it seems that the information is shared among nodes during estimation so that this is not an issue. Or, do I need to make sure the amount of data in each shard is large enough to estimate the model on its own, or that enough shards are running at once to be able to share enough data across the nodes that are running? Any clarification, or direction to (relatively non-technical) resources I could use to learn about this, would be appreciated. Thank you!

The likelihood you code using MPI will be the very same without it. So nothing changes in this regard, but maybe I missed something.

thank you! so even if the model is run on parts of the data asynchronously, it ends up affecting the log probability the same.

ETA: how does that work with warmup then?

There is no difference at all. I mean, you can write your program using map_rect and it will code exactly the same likelihood. Also: You can run the very same program with map_rect using

  • no parallelisation
  • parallelisation with threads
  • parallelisation with MPI

all 3 things will give you the same model being fit.

(for MPI the static data will get shipped to each node and the parameters are shipped for each leapfrog transition to the nodes while the result comes back from the nodes)

I see, thank you for explaining. I think where part of my confusion is coming from is understanding when information is getting passed among nodes with MPI. If this is occurring at each leapfrog transition, does that mean that all the shards need to be able to run simultaneously? E.g., if I have 28 nodes, I should run a max of 28 shards (versus running more, smaller shards, but not at the same time). Or I am I still missing something here?

Right. When map_rect is parallelized, the evaluation of the log density and its gradient are what gets parallelized. So that’s every leapfrog step.

With map_rect, you will get the same log density and gradients running in parallel using MPI, in parallel using multi-threading, or sequentially. The only thing MPI or multi-threading do is run map_rect in parallel. Stan can run map_rect in parallel safely because Stan programs have no user-facing side effects and it can synchronize the autodiff behind the scenes.

If there are N shards in a call to map_rect, then we can at most N-way parallelism. The most efficient thing to do is usually to set N to roughly the number of cores. I say “roughly” because of issues like hyperthreading and memory contention.

That helps, thank you! I’m looking forward to trying it out.