Correct way to use MPI with cmdstanpy

Hello all,
I am trying to launch a cmdstanpy sampling using MPI with reduce_sum and I have a few questions about the correct implementation.
I modified the make/local file and build the binaries, but still have a few questions:

  • Should I compile the model with STAN_THREADS: TRUE and define os.environ["STAN_NUM_THREADS"] = "n" (as if I was multithreading on a single node) or not?
  • Should I submit only the sampling process as a cluster MPI job or can it be the entire process including model compilation?
  • For some reason, the progress bar goes away when I run the sampling with MPI. Is there a way to bring it back? (I already use show_progress=True).

Thank you!

damn - we don’t yet have the correct invocation needed to run MPI - cf discussion in this issue:

the implementation should be really pretty simple - PR’s welcome!

if you’re running reduce_sum without MPI, there’s some documentation here:

it can be the entire process including model compilation - or you can compile the model and pass in the exe file location - here’s an example script to run a single chain per node on a cluster:

# Use CmdStanPy to run one chain
# Required args:
# - cmdstanpath
# - model_exe
# - seed
# - chain_id
# - output_dir
# - data_file

import os
import sys

from cmdstanpy import CmdStanModel, set_cmdstan_path, cmdstan_path

useage = """\ <cmdstan_path> <model_exe> <seed> <chain_id> <output_dir> (<data_path>)\

def main():
    if (len(sys.argv) < 5):
        print("missing arguments")
    a_cmdstan_path = sys.argv[1]
    a_model_exe = sys.argv[2]
    a_seed = int(sys.argv[3])
    a_chain_id = int(sys.argv[4])
    a_output_dir = sys.argv[5]
    a_data_file = None
    if (len(sys.argv) > 5):
        a_data_file = sys.argv[6]

    mod = CmdStanModel(exe_file=a_model_exe)
    fit = mod.sample(chains=1, chain_ids=[a_chain_id], seed=a_seed, output_dir=a_output_dir, data=a_data_file)

if __name__ == "__main__":

I am so sorry, but now I am even more confused. Based on the thread I understand that MPI + reduce_sum won’t work? Is it really so?

Thanks! I just tried it with 120 cores and many cores failed to compile the model at the same time. So perhaps a better idea is to pass a compiled model :)

Yes, we do not have a mpi backend for reduce sum. You can only split work with map rect over mpi and then nest reduce sum into that.

Is it something that might me implemented anytime soon? My big hope was to run MPI and reduce_sum.

I don’t think so.


Not sure if it makes sense, but is it possible to use MPI for parallelization instead of reduce_sum then? How would this work? If within chain parallelization with reduce_sum cannot be used with MPI, is the idea of MPI to run dozens of chains in parallel? Or using MPI makes sense only with map_rect?

using MPI only makes sense with map_rect, as that’s what’s needed to spread the computation across processes.

if you can use reduce_sum to compute what you want to compute, that amount of parallelism might be good enough.

1 Like

Thanks! It is true in most cases for reduce_sum, but without MPI, reduce_sum is limited to the number of cores in a single node, which might not be enough. I guess it is going back to map_rect for me.