I am trying to launch a cmdstanpy sampling using MPI with reduce_sum and I have a few questions about the correct implementation.
I modified the make/local file and build the binaries, but still have a few questions:
Should I compile the model with STAN_THREADS: TRUE and define os.environ["STAN_NUM_THREADS"] = "n" (as if I was multithreading on a single node) or not?
Should I submit only the sampling process as a cluster MPI job or can it be the entire process including model compilation?
For some reason, the progress bar goes away when I run the sampling with MPI. Is there a way to bring it back? (I already use show_progress=True).
Not sure if it makes sense, but is it possible to use MPI for parallelization instead of reduce_sum then? How would this work? If within chain parallelization with reduce_sum cannot be used with MPI, is the idea of MPI to run dozens of chains in parallel? Or using MPI makes sense only with map_rect?