I’ve been using Stan for a while to fit hierarchical models, with good success (so thank you!). I recently got access to a computing cluster and have been looking into using MPI and map_rect to run my models on the cluster. The models I use generally take hours to days to run so the resulting speedup seems very useful. However, as someone new to MPI, I’m not sure I understand how the HMC estimation works with it.
My models generally estimate hierarchical data that have relatively little data at lower levels (I look at peoples’ performance on decision-making tasks, where the first level of data is choices on individual trials and the second level is subjects, where trials are nested within subjects), and so benefit greatly from partial pooling. If I run with only a few higher-level observations (i.e., only a few subjects’ data), models don’t converge or have other estimation issues. This makes me worry about splitting data into small shards with MPI because it seems that the model will not converge if it is only using a shard’s worth of data, but from my reading about MPI it seems that the information is shared among nodes during estimation so that this is not an issue. Or, do I need to make sure the amount of data in each shard is large enough to estimate the model on its own, or that enough shards are running at once to be able to share enough data across the nodes that are running? Any clarification, or direction to (relatively non-technical) resources I could use to learn about this, would be appreciated. Thank you!
I see, thank you for explaining. I think where part of my confusion is coming from is understanding when information is getting passed among nodes with MPI. If this is occurring at each leapfrog transition, does that mean that all the shards need to be able to run simultaneously? E.g., if I have 28 nodes, I should run a max of 28 shards (versus running more, smaller shards, but not at the same time). Or I am I still missing something here?
Right. When map_rect is parallelized, the evaluation of the log density and its gradient are what gets parallelized. So that’s every leapfrog step.
With map_rect, you will get the same log density and gradients running in parallel using MPI, in parallel using multi-threading, or sequentially. The only thing MPI or multi-threading do is run map_rect in parallel. Stan can run map_rect in parallel safely because Stan programs have no user-facing side effects and it can synchronize the autodiff behind the scenes.
If there are N shards in a call to map_rect, then we can at most N-way parallelism. The most efficient thing to do is usually to set N to roughly the number of cores. I say “roughly” because of issues like hyperthreading and memory contention.