Parallelization within chain and between nodes?

Hi everyone !

I have a complex model that is going to be further complexify next year! So I’m trying to optimize my code as well as the use of my cluster, so any help will be more than welcome !

I have implemented the within-chain parallelization: I have four chains, with 20 threads per chains as I’m requesting 80 cores (with reduce-sum) running on one node.

However, I would prefer to split those ressources on several nodes (to avoid a long pending time on the cluster) like : 4 nodes, on each node one chains would be running, with 20 threads per chain. Globally I will then be requesting 4*20 cores, which is faster to obtain than 80 cores.

To do so, I think that I need to switch from reduce-sum to map-rect and MPI (see below)

Is it indeed the case please? Do you think that is it worth it to switch to map-react and MPI? Is there some documentation on how to do this switch (both for map-rect and for MPI, knowing that I don’t have a computer-science backgroup)?

Or is there a way to stay with reduce sum and benefit from multi-nodes? Is there another alternative?

Thank you in advance for any help!

-S

MPI is needed when parallelizing a single chain across multiple nodes. But according to your example above, you don’t actually need this. You could just submit four jobs, each for one chain, and combine the chains after the fact. That is, on each of four nodes, submit a 20-core single-chain job parallellizing within nodes using reduce-sum.

You’d need to make sure that the seeds are distinct on the different chains. Otherwise there are no complications to this approach.

1 Like

Thank you for your quick reply and guidance !

As my model would request in fact more than 80 cores (for example 200 cores, in fact even more) and as it’s faster to break the requested ressources for the cluster, I’m wondering if I could mix your approach (option2) with MPI?

Initial option (1) : classic reduce_sum :

  • 1 job requested that land on one node.
  • On 1 node, 4 chains runs using 200 cores (so 50 threads per chains).
  • (=> globally 200 cores)

option (2) : reduce_sum per job :

  • 4 independant jobs requested, that are launch independently on 1 node.
  • On one node, 1 chain run using 50 cores (so 50 threads per chains)
  • (=> globally 200 cores)

option (3) : reduce_sum per job and MPI :

  • 4 independant jobs requested. Each job would use 5 nodes.
  • In each job (so for 5 “dependent” nodes), 1 chain would run using 10 cores (so 10 threads per chains)
  • (=> globally 200 cores)

Is it possible to implement this option 3 with STAN please?

Assuming that your cluster has nodes with N cores, the easiest way to scale your current model would be to submit one job per each chain; each job requests a single node and N cores and runs the chain with N threads. After all four jobs are done, have another combine the results; an initial setup job may also be useful to format all of the inputs.
Doing it this way shouldn’t require modifications to the stan code.

Side note: if you’re using cmdstanr, you can keep sample()'s seed argument the same for all runs, but increment the chain_idsfor each job.

If you want to run a single chain across multiple nodes, you’ll need to consider using map_rect and MPI, which will require changing the model. These changes are usually not conceptually difficult, but they do require a lot of fiddly packing and unpacking of variables, which can be an easy place for bugs to enter the program. I’d try seeing how much performance you can get out of a single node per chain before switching.

1 Like

Hi everyone,

Thank you very much for all the replies, they were really helpful!
I feel like I’m making progress with map_rect and MPI, using cmdstanr, but I would like to double-check my understanding and clarify a few points.

@wds15, if you have time, I’d be very grateful for your insight as well :)

Using my cluster, I am now able to request in 1 job, 4 nodes, each with 40 cores (so 160 cores in total). I am running 4 chains, and my understanding is that this corresponds to 1 chain per node, each chain using 40 cores. My goal is to leverage both between-node parallelization (via MPI) and within-chain parallelization (via map_rect) at the same time.

When I run the model, Stan prints:

Starting sampling with sample_mpi…
Running MCMC with 4 chains using MPI with 4 processes…

and then I see output like:

Chain 1 Iteration: 1 / 1000 (Warmup)
Chain 1 Iteration: 2 / 1000 (Warmup) …

So I have two questions :


1) MPI + map_rect: between-node and within-chain parallelization

Does using MPI + map_rect indeed allow between-node parallelization and within-chain parallelization at the same time?

In other words:

  • Is it possible for each chain to run on a separate node and
  • for each chain to internally parallelize computations using multiple cores?

Or is there a fundamental limitation that prevents combining these two levels of parallelism?
Sorry if this is obvious — I just want to be absolutely sure that I understand what is happening under the hood.


2) Debugging map_rect functions with print ?

To debug my code and check that variables are correctly packed and unpacked, I would like to use print statements in my Stan model.

When I print inside the transformed data, transformed parameters, or model blocks, I do get output in the output file, for example:

Chain 1 in transformed data block
Chain 1 lesion 1 n_obs=2 max_obs=3 sn=1 en=1 sp=2 ep=2
Chain 1   t1=[-29,50,0] y1=[111,75,0] cens1=0
...
Chain 1 Iteration: 1 / 1000 (Warmup)

However, when I try to use print inside the function block, and more specifically inside the function mu_fun used by map_rect:

map_rect(mu_fun, shared, job, x_r, x_i);

I do not see any output at all. This happens regardless of what I try to print (a fixed number, a fixed string, a local variable, or data passed to the function)

So my questions are:

  • Is this behavior expected when using map_rect (with or without MPI)?
  • Is this a known limitation rather than a bug?
  • Is there a recommended workaround?

More generally, what is the recommended way to debug functions used inside map_rect, especially when running with MPI?

Thank you very much in advance for any help or clarification!

Best,

-S

I’m curious about this myself. It’s one of the sticking points for Stan vs. embedded PPLs—we don’t have a good debugging environment. The best we can do is look at MCMC or optimization output, or instrument the code with print statements. As always, breaking things down into the simplest possible situation to debug a function is usually helpful.

Can I just double check, for your particular use case:

  1. You are able to request more total cores only by requesting more nodes, and also
  2. You have reason to believe that the wall time continues to decrease meaningfully as you increase the number of cores per chain beyond 40, and also
  3. Your model is long-running enough, and the warmup is costly enough, that it’s important to bring down the total wall-time devoted to warmup, rather than just choosing to run more chains through warmup and then into relatively shorter sampling stages?

All three of these things need to be true in order for your goal to be potentially worthwhile. Even then, my honest recommendation would be to experiment with shortening warmup and running more chains, rather than jumping straight to tinkering with MPI.

When I say shortening warmup, I don’t mean blindly reducing the length of the warmup phase and hoping it works ok. I mean finding ways to shorten warmup without sacrificing the quality of adaptation, i.e. by using better inits, a better initial guess for the inverse metric, and a longer first window length.

1 Like

Thank you very much for the reply!!!
Any help is more than welcome, so please don’t hesitate to ask for additional information or clarification if needed, and feel free to double- or triple-check my reasoning :)
I’m a pharmacist by training, so HPC and computer-science concepts are still something I’m actively learning.

1) Cluster constraints and job configuration

Per task, I have a hard limit of 90 cores.

One possible approach would be to run four separate jobs, each using 90 cores, with one chain per job. In that case, each chain could leverage reduce_sum, and I would merge the chains afterward. While this is technically feasible, I am somewhat concerned about managing many output files across the large number of model variants I plan to run. More importantly, I rely heavily on the generated quantities block for simulations and predictions, and I feel uneasy about running chains independently and recombining outputs afterward, although I’m not sure whether this concern is actually justified.

Another approach would be to run a single job with multiple tasks, for example four tasks (one per chain), each task using around 90 cores. This is the configuration that initially led me to consider MPI together with map_rect, as it feels conceptually cleaner to me: one job, one model, one set of outputs.

In practice, although the hard limit is 90 cores, I currently request only around 40 cores, mainly because queue times for 90 cores are very long, and because I am not yet convinced that increasing from 40 to 90 cores would lead to a meaningful reduction in wall time.

2) Expected scaling with additional cores

At this stage, I do not have strong empirical evidence that wall time will continue to decrease substantially as the number of cores increases, but given the size of the model, I suspect that additional parallelism might still be beneficial.

In a previous project (see poster
A_MERLAUD_Characterizing_the_organ-specific_association_between_tumor_dynamics_and_overall_survival_across_cancer_types_and_studies.pdf (1.6 MB)
), I analyzed trials separately. In one such trial, I had approximately 1,400 tumors measured longitudinally, nested within about 600 patients, and I fitted a multilevel joint model combining tumor dynamics and survival, with random effects at the lesion and patient levels. Using reduce_sum led to a substantial speed-up, but I still reached a wall time of around 8 hours.

I now plan to analyze all studies simultaneously, with roughly 30,000 tumors measured over time across 22 studies, three levels of random effects (tumor, individual, and study), and several fixed effects to assess covariates. Given the much larger number of observations, I initially assumed that it might be possible to make profitable use of more cores than before. In addition, in another thread (see Code optimization & reduce_sum), you suggested slicing at the tumor level rather than by study, which also seems to point in the direction of increased parallelism.

At this point, perhaps somewhat naively, I feel that I should either fully commit to a reduce_sum-based implementation or switch to MPI with map_rect.

3) Current runtimes

With a simplified version of the model, including only the tumor dynamics component, no survival model, random effects only, and no fixed effects, runtime was approximately three days for 1,000 iterations, including 750 warm-up iterations with 2 chains and 80 cores and a grainsize of 20.

I do plan of course to reduce the number of iterations, improve priors and initial values, and continue optimizing the model structure.

Given all this, would you still recommend avoiding MPI and map_rect in my case, or do you think there could indeed be a benefit despite the additional complexity?

Thank you again for your time and advice!!!

-S

1 Like

If I were in your position I would stick with reduce_sum and informative naming of the output files. Also, if you need to run a bunch of model variants (and you know what several variants are before seeing the outputs of previous variants sequentially), then you’re better off parallelizing over those runs of multiple variants rather than parallelizing within chains. For example if you have 4 nodes with 80 cores each, and four variants to run, I would run one variant with 4 chains and 20 threads per chain on each node.

1 Like