Parallelization within chain and between nodes?

Hi everyone !

I have a complex model that is going to be further complexify next year! So I’m trying to optimize my code as well as the use of my cluster, so any help will be more than welcome !

I have implemented the within-chain parallelization: I have four chains, with 20 threads per chains as I’m requesting 80 cores (with reduce-sum) running on one node.

However, I would prefer to split those ressources on several nodes (to avoid a long pending time on the cluster) like : 4 nodes, on each node one chains would be running, with 20 threads per chain. Globally I will then be requesting 4*20 cores, which is faster to obtain than 80 cores.

To do so, I think that I need to switch from reduce-sum to map-rect and MPI (see below)

Is it indeed the case please? Do you think that is it worth it to switch to map-react and MPI? Is there some documentation on how to do this switch (both for map-rect and for MPI, knowing that I don’t have a computer-science backgroup)?

Or is there a way to stay with reduce sum and benefit from multi-nodes? Is there another alternative?

Thank you in advance for any help!

-S

MPI is needed when parallelizing a single chain across multiple nodes. But according to your example above, you don’t actually need this. You could just submit four jobs, each for one chain, and combine the chains after the fact. That is, on each of four nodes, submit a 20-core single-chain job parallellizing within nodes using reduce-sum.

You’d need to make sure that the seeds are distinct on the different chains. Otherwise there are no complications to this approach.

1 Like

Thank you for your quick reply and guidance !

As my model would request in fact more than 80 cores (for example 200 cores, in fact even more) and as it’s faster to break the requested ressources for the cluster, I’m wondering if I could mix your approach (option2) with MPI?

Initial option (1) : classic reduce_sum :

  • 1 job requested that land on one node.
  • On 1 node, 4 chains runs using 200 cores (so 50 threads per chains).
  • (=> globally 200 cores)

option (2) : reduce_sum per job :

  • 4 independant jobs requested, that are launch independently on 1 node.
  • On one node, 1 chain run using 50 cores (so 50 threads per chains)
  • (=> globally 200 cores)

option (3) : reduce_sum per job and MPI :

  • 4 independant jobs requested. Each job would use 5 nodes.
  • In each job (so for 5 “dependent” nodes), 1 chain would run using 10 cores (so 10 threads per chains)
  • (=> globally 200 cores)

Is it possible to implement this option 3 with STAN please?

Assuming that your cluster has nodes with N cores, the easiest way to scale your current model would be to submit one job per each chain; each job requests a single node and N cores and runs the chain with N threads. After all four jobs are done, have another combine the results; an initial setup job may also be useful to format all of the inputs.
Doing it this way shouldn’t require modifications to the stan code.

Side note: if you’re using cmdstanr, you can keep sample()'s seed argument the same for all runs, but increment the chain_idsfor each job.

If you want to run a single chain across multiple nodes, you’ll need to consider using map_rect and MPI, which will require changing the model. These changes are usually not conceptually difficult, but they do require a lot of fiddly packing and unpacking of variables, which can be an easy place for bugs to enter the program. I’d try seeing how much performance you can get out of a single node per chain before switching.

1 Like