Hello all,
I am trying to launch a cmdstanpy sampling using MPI with reduce_sum and I have a few questions about the correct implementation.
I modified the make/local file and build the binaries, but still have a few questions:
Should I compile the model with STAN_THREADS: TRUE and define os.environ["STAN_NUM_THREADS"] = "n" (as if I was multithreading on a single node) or not?
Should I submit only the sampling process as a cluster MPI job or can it be the entire process including model compilation?
For some reason, the progress bar goes away when I run the sampling with MPI. Is there a way to bring it back? (I already use show_progress=True).
it can be the entire process including model compilation - or you can compile the model and pass in the exe file location - here’s an example script to run a single chain per node on a cluster:
# Use CmdStanPy to run one chain
# Required args:
# - cmdstanpath
# - model_exe
# - seed
# - chain_id
# - output_dir
# - data_file
import os
import sys
from cmdstanpy import CmdStanModel, set_cmdstan_path, cmdstan_path
useage = """\
run_chain.py <cmdstan_path> <model_exe> <seed> <chain_id> <output_dir> (<data_path>)\
"""
def main():
if (len(sys.argv) < 5):
print("missing arguments")
print(useage)
sys.exit(1)
a_cmdstan_path = sys.argv[1]
a_model_exe = sys.argv[2]
a_seed = int(sys.argv[3])
a_chain_id = int(sys.argv[4])
a_output_dir = sys.argv[5]
a_data_file = None
if (len(sys.argv) > 5):
a_data_file = sys.argv[6]
set_cmdstan_path(a_cmdstan_path)
mod = CmdStanModel(exe_file=a_model_exe)
fit = mod.sample(chains=1, chain_ids=[a_chain_id], seed=a_seed, output_dir=a_output_dir, data=a_data_file)
print(fit)
if __name__ == "__main__":
main()
I am so sorry, but now I am even more confused. Based on the thread I understand that MPI + reduce_sum won’t work? Is it really so?
Thanks! I just tried it with 120 cores and many cores failed to compile the model at the same time. So perhaps a better idea is to pass a compiled model :)
Not sure if it makes sense, but is it possible to use MPI for parallelization instead of reduce_sum then? How would this work? If within chain parallelization with reduce_sum cannot be used with MPI, is the idea of MPI to run dozens of chains in parallel? Or using MPI makes sense only with map_rect?
Thanks! It is true in most cases for reduce_sum, but without MPI, reduce_sum is limited to the number of cores in a single node, which might not be enough. I guess it is going back to map_rect for me.