I have connected a stan program to MPI in hopes for a speedup. I have 10 nodes running (verified by seeing their CPUs max out), but I am not seeing a 10x speedup. I do have a small network latency, so not sure if that’s throttling me a bit. Here is the output of the running program:
In order to use multi-processing with MPI in a Stan model, the models must be rewritten to use the map_rect function. By using MPI, the model can be parallelized across multiple cores or a cluster.
hello, I re-wrote my function to use map_rect() function. MPI configured.
When it comes to this step:
fit = model.sample(data=stan_data, **rc.sample_kwargs)
I now replace it with:
fit = cmdstanpy.from_csv(‘output.csv’) (the output.csv is the output from the MPI process, seems to be populated with all the outputs)
I get this error:
Invalid or corrupt Stan CSV output file.
How do I get a MPI output file to use for fitting? thanks,
it appears that the from_csv() is a bit brittle because the
subprocesses() generation of csv_output.csv creates a csv file but it may not be exactly how cmdstanpy would expect (for example, there are true/false values instead of 1/0 values that the from_csv() does not like)
using cpp_options ^ as seen above, doesn’t run the other nodes in my MPI network.
the subprocess.run() worked as intended, but I can’t seem to get cmdstanpy to work with its output to create the fit. anyone able to get MPI working with this?
Hi, @e32432423 (?—shouldn’t the last two digits be “32”?).
What platform are you running on? I believe this is much more challenging with Windows.
You can use multiple threads within a single machine, or MPI across machines. Usually it’s better to scale up on one machine until you run out of room and only then scale out using MPI, which can be much slower due to network latency. MPI is only going to pay off across machines if you have more compute to do than network latency, which will depend on the Stan program and the hardware setup.
It’s generally helpful to us debug if you provide a reproducible example. It looks like you’re running at least something through python?
and then
fit = cmdstanpy.from_csv(‘output.csv’) but the from_csv() is brittle and the outputted csv from subprocess.run() doesn’t match exactly to what from_csv() is looking for (e.g., true/false instead of 1/0, and other things)
I think this is a general cmdstanpy + mpi issue. I cannot change the title. There are 2 possible solutions:
ensure the community can run the command:
mpiexec -n 4 -f node_config_file model.exe_file
within the cmdstanpy syntaxes, where node_config_file tells how many threads per node to run, or
the cmdstanpy.from_csv() needs to read the csv outputted from the subprocess.run() command in above post.
I would imagine addressing either of these will solve this general problem for anyone trying to use mpi+cmdstanpy. There is no obvious cmdstanpy equivalent of .sample_mpi() (with mpi_args argument) from cmdstanr world.
my cmdstanpy version is
1.2.2
I notice there is 1.2.4. I will try it. But
what is the syntax such that I can run a multi-node mpi command such as this in cmdstanpy, like cmdstanr allows?
mpiexec -n 4 -f node_config_file model.exe_file
i tried:
cpp_options = {
‘STAN_MPI’: True,
‘CXX’: ‘mpicxx’,
‘TBB_CXX_TYPE’: ‘gcc’,
‘mpiexec’: ‘-n 4 -f node_config_file’
}
model = cmdstanpy.CmdStanModel(stan_file=stan_file_path, cpp_options=cpp_options, force_compile=True)
fit = model.sample(data=stan_data)
and it doesn’t execute the MPI as expected.
If not possible, then will 1.2.4 resolve the cmdstanpy.from_csv() being brittle? The subprocess.run() command of the executable (see above post) runs the MPI as expected, but the outputted ‘output.csv’ is not readable by cmdstanpy.from_csv().
the above has discussed running the .exe file using mpi command, parallelizing for num_chains=1, via n_shard and map_rect. This helps with the gradient evaluation time.
is there a way to parallelize to decrease iteration completion time? Does an iteration include many serial gradient evaluations?
is there also a way to parallelize across num_chains, where num_chains>1? Currently it runs serially, chain1 iterations…then chain2 iterations… I would imagine that is a parallelizable opportunity…
If you’re already using map_rect, there are not really any additional opportunities for parallelism within one iteration. There are some (potentially wasteful) options discussed here, but not implemented
Yes, definitely. Parallelization. You can also do what cmdstanpy and friends used to do and wrap your subprocess call in something from python’s multiprocessing module