I am using CmdStan to solve a simple SIR model. To solve the system though I am using an external C++ code that uses the PETSc library. I would like to take advantage of the parallel library. The idea I would like to implement is the following:
run the Stan code with 1 proc
when the system needs to be solved, multiple procs are called and the solution is computed
back to Stan code with 1 proc
I think I am very close to the final solution but I am not sure I correctly understood what happen with Stan when I run, for instance, mpiexec -n 5. I expect to have as many copies as n says.
Is it possible that during the conversion, from 1 to multiple processors and viceversa, some values don’t go where expected?
Does anyone know if there are some important problems I should take care of when running multiple copies of Stan? Is anyone familiar with that?
If you need more information, please let me know.
Thank you in advance.
From your description I cannot comment on what’s happening, and whether there are any errors or inefficiencies with what you are doing, but if you are using ODEs in principle there is not much to parallelize in a in-practice markovian system like the SIR. I am guessing the step size optimization could be parallelized, not sure if it would be very efficient. There has been some discussion here about the possibility of using Discrete Time Solver instead of ODEs to get rid of the step size optimization cost, but there are caveats and to me it’s not clear that the benefits would be considerable.
I’m not familiar with the PETSc library, maybe others here are, but providing additional details on what you expect to accomplish with your scheme and how any tools external to Stan are doing it it would make it more understandable to most people here.
Let me quickly tell you what I know. I think the linked “discrete time solver” is just the explicit Euler method with a fixed step size of one. This or similar approaches (anything not using the actual ODE solvers) will introduce the following benefits:
Exact gradient computation (relative to the potentially very inexact approximation),
as a consequence the ability to arbitrarily coarsely approximate the ODE without derailing the HMC sampler (this is not possible with the default ODE solvers),
as a consequence potentially very quick qualitative prototyping.
It’s drawbacks are that
you have to control the quality of your approximation yourself (outside of Stan) by tuning some fixed stepsize,
it’s generally more work to do well,
it might be much slower than the built-in ODE solvers for the same precision.
It can be very efficient, but you have to monitor the convergence of your approximation!
I’d add what I probably said already in some of those discussions, which is that some of these drawbacks could be addressed by using a particle filter-like approach (UKF would be one of the more general approaches) that would correct the lack of precision in computing the new state in time – i.e. instead doing internal error correction for ODEs there would be external error correction which is the basis of particle filter approaches.
Performancewise one or the other may be better suited, but in either case parallelization is likely not something that will help considerably (but there may be particular cases where it might, so we should also wait and see what you have in mind).
Thank you for both your answers.
In principle I just want to combine these two libraries with a parallel code from the PETSc side. The SIR model is just an easy example I picked. I will try to give some more details.
My expectations regarding this experiment are the following:
Stan picks the parameters of the model
pass the parameters to PETSc - whose solves the ODEs system
PETSc gives the solution during the entire time span
Stan computes new parameters.
I developed a sequential version that works. Now I would like to extend the PETSc part to a parallel version, thus the implementation should make use of multiple processes only during the PETSc section. What I obtain when I run the code is only Iteration: 1 / 2000 [ 0%] (Warmup)
and nothing more. Even though I leave the code running for hours, the warmup phase does not go any further.
As the sequential version of these codes works, I believe the problem is related with the continuous switch between 1 to multiple processors. I tested the conversion functions multiple times and they actually work, so I was wandering if the problem might be related with the way Stan interacts with multiple processors.
If you need more specific information or more details in some parts, please let me know.
In that case, the Stan part is sequential on each proc. As I need to solve the ODEs system only once (because the 5 chains are identical), I use the parameters from the 0 proc coming from Stan to build a new parallel vector compatible with PETSc. Next, I use all 5 procs to solve the system in PETSc. I copy the solution on each proc right after the solution is computed, so every chain has its own solution. And finally the code goes back to Stan and computes new parameter values. How does it sound to you?
The problem of this implementation is the one I reported above (stuck at the warmup).
The work is indeed based on what he already did. It is very similar to what it is already available on his GitHub but, in this case, the PETSc part runs in parallel.