Cmdstanpy, mpi speedup

Hi all,

I have connected a stan program to MPI in hopes for a speedup. I have 10 nodes running (verified by seeing their CPUs max out), but I am not seeing a 10x speedup. I do have a small network latency, so not sure if that’s throttling me a bit. Here is the output of the running program:

Starting mpi process
method = sample (Default)
sample
num_samples = 1000 (Default)
num_warmup = 500
save_warmup = false (Default)
thin = 1 (Default)
adapt
engaged = true (Default)
gamma = 0.05 (Default)
delta = 0.8 (Default)
kappa = 0.75 (Default)
t0 = 10 (Default)
init_buffer = 75 (Default)
term_buffer = 50 (Default)
window = 25 (Default)
save_metric = false (Default)
algorithm = hmc (Default)
hmc
engine = nuts (Default)
nuts
max_depth = 10 (Default)
metric = diag_e (Default)
metric_file = (Default)
stepsize = 1 (Default)
stepsize_jitter = 0 (Default)
num_chains = 1 (Default)
id = 1 (Default)
data
file = stan_data.json
init = 2 (Default)
random
seed = 3950355111 (Default)
output
file = blah.csv
diagnostic_file = (Default)
refresh = 100 (Default)
sig_figs = -1 (Default)
profile_file = profile.csv (Default)
save_cmdstan_config = false (Default)
num_threads = 1 (Default)
mpi_enabled = 1

Iteration: 1 / 1500 [ 0%] (Warmup)

is there a specific config change I need to explore to benefit from a MPI speedup further? Thank you,

Did some reading… it looks like MPI is designed for parallelizing across multiple chains, not for speeding up a single chain. Is this correct?

If so, is there a way to speed up a single chain?..

Yes! See Parallelization

thank you. the piece I may have missed is this:

In order to use multi-processing with MPI in a Stan model, the models must be rewritten to use the map_rect function. By using MPI, the model can be parallelized across multiple cores or a cluster.

I will do this next

hello, I re-wrote my function to use map_rect() function. MPI configured.

When it comes to this step:
fit = model.sample(data=stan_data, **rc.sample_kwargs)

I now replace it with:
fit = cmdstanpy.from_csv(‘output.csv’) (the output.csv is the output from the MPI process, seems to be populated with all the outputs)

I get this error:

Invalid or corrupt Stan CSV output file.

How do I get a MPI output file to use for fitting? thanks,

it appears that the from_csv() is a bit brittle because the

subprocesses() generation of csv_output.csv creates a csv file but it may not be exactly how cmdstanpy would expect (for example, there are true/false values instead of 1/0 values that the from_csv() does not like)

so I stumbled on this:

here I found I can do

cpp_options = {
                'STAN_MPI'           : True ,      
                'CXX': 'mpicxx',
                'TBB_CXX_TYPE': 'gcc'
              }

sm = CmdStanModel(stan_file='./stan_files/BNN.stan', cpp_options = cpp_options, compile='force')

but I don’t have control of # of nodes for MPI to run,

using cpp_options ^ as seen above, doesn’t run the other nodes in my MPI network.

the subprocess.run() worked as intended, but I can’t seem to get cmdstanpy to work with its output to create the fit. anyone able to get MPI working with this?

Hi, @e32432423 (?—shouldn’t the last two digits be “32”?).

What platform are you running on? I believe this is much more challenging with Windows.

You can use multiple threads within a single machine, or MPI across machines. Usually it’s better to scale up on one machine until you run out of room and only then scale out using MPI, which can be much slower due to network latency. MPI is only going to pay off across machines if you have more compute to do than network latency, which will depend on the Stan program and the hardware setup.

It’s generally helpful to us debug if you provide a reproducible example. It looks like you’re running at least something through python?

I chose “23” at the end to throw you off Bob ;).

this is linux.

i found this:

for cmdstanpy equivalent… it looks like I need to stick with cmdstanpy so I found the following:

cpp_options = {
    'STAN_MPI': True,
    'CXX': 'mpicxx',
    'TBB_CXX_TYPE': 'gcc'
}
model = cmdstanpy.CmdStanModel(stan_file=stan_file_path, cpp_options=cpp_options, force_compile=True)

in cmdstanr world, the next step would be:
fit = model.sample_mpi(data=stan_data, mpi_cmd = “mpiexec”,mpi_args = “-n 4”)

for running
mpiexec -n 4 model_executable

but in python, any attempt at that i get
AttributeError: ‘CmdStanModel’ object has no attribute ‘sample_mpi’

as there doesn’t seem to be any matching documentation for cmdstanpy equivalents
https://mc-stan.org/cmdstanpy/

Instead of running sample() from Python, i tried to run the command manually:

subprocess.run([
‘mpirun’, ‘-np’, ‘4’, ‘-f’, ‘node_config_file’, model.exe_file,
‘sample’, ‘num_samples=1000’, ‘num_warmup=500’,
‘data’, ‘file={}’.format(stan_data_file),
‘output’, ‘file=output.csv’
])

and then
fit = cmdstanpy.from_csv(‘output.csv’) but the from_csv() is brittle and the outputted csv from subprocess.run() doesn’t match exactly to what from_csv() is looking for (e.g., true/false instead of 1/0, and other things)

I think this is a general cmdstanpy + mpi issue. I cannot change the title. There are 2 possible solutions:

  1. ensure the community can run the command:
    mpiexec -n 4 -f node_config_file model.exe_file

within the cmdstanpy syntaxes, where node_config_file tells how many threads per node to run, or

  1. the cmdstanpy.from_csv() needs to read the csv outputted from the subprocess.run() command in above post.

I would imagine addressing either of these will solve this general problem for anyone trying to use mpi+cmdstanpy. There is no obvious cmdstanpy equivalent of .sample_mpi() (with mpi_args argument) from cmdstanr world.

Are you using the latest cmdstanpy?

Have you been able to run any non-MPI sampling successfully?

my cmdstanpy version is
1.2.2
I notice there is 1.2.4. I will try it. But

  1. what is the syntax such that I can run a multi-node mpi command such as this in cmdstanpy, like cmdstanr allows?
    mpiexec -n 4 -f node_config_file model.exe_file
    i tried:
    cpp_options = {
    ‘STAN_MPI’: True,
    ‘CXX’: ‘mpicxx’,
    ‘TBB_CXX_TYPE’: ‘gcc’,
    ‘mpiexec’: ‘-n 4 -f node_config_file’
    }
    model = cmdstanpy.CmdStanModel(stan_file=stan_file_path, cpp_options=cpp_options, force_compile=True)
    fit = model.sample(data=stan_data)
    and it doesn’t execute the MPI as expected.
  2. If not possible, then will 1.2.4 resolve the cmdstanpy.from_csv() being brittle? The subprocess.run() command of the executable (see above post) runs the MPI as expected, but the outputted ‘output.csv’ is not readable by cmdstanpy.from_csv().

yes, non-MPI sampling works fine.

Cmdstanpy does not have this feature built in in the way that it appears cmdstanr does

The primary reason for the 1.2.4 release was to fix issues with reading the CSV files from recent cmdstans

  1. May I ask why?
  2. I am hoping 1.2.4 has from_csv() fixed- trying now,

I don’t think there is any reason other than nobody wrote the code. sample_mpi being added to cmdstanr predates my involvement with the project

cmdstanpy 1.2.4 seemed to read in the csv file using cmdstanpy.from_csv() method. this is excellent, and considered closed at this time.

here’s a long shot question.

the above has discussed running the .exe file using mpi command, parallelizing for num_chains=1, via n_shard and map_rect. This helps with the gradient evaluation time.

  1. is there a way to parallelize to decrease iteration completion time? Does an iteration include many serial gradient evaluations?
  2. is there also a way to parallelize across num_chains, where num_chains>1? Currently it runs serially, chain1 iterations…then chain2 iterations… I would imagine that is a parallelizable opportunity…

If you’re already using map_rect, there are not really any additional opportunities for parallelism within one iteration. There are some (potentially wasteful) options discussed here, but not implemented

Yes, definitely. Parallelization. You can also do what cmdstanpy and friends used to do and wrap your subprocess call in something from python’s multiprocessing module

is it true that the pymc devs found a way to parallelize via cpu/gpu? If so, how can Stan do same?

it seems like the sampling can be parallelized?..