Just for the above example (planetary motion of a single body with unknown mass and initial positing around some object) a few things come to mind.
For example, one could limit the model to the first few time steps at first (during warmup?) , to not waste so much time on regions in the parameter space that are hopeless.
Likewise, I assume we might sacrificing accuracy of the numerical solver to speed up the computations. On the other hand, this will surely impact the performance of the HMC method. However, even for moderate accuracies, the flow of the system should not actually change too much. So there might be some potential there, at least for the warmup phase?
Iām sure other people have thought about this longer and with more background knowledge. Is there maybe a collection of resources or papers that I should know about /look at?
I hope this is the appropriate place for this question!
We had several round of discussion on this and I think the only actionable recommendation for general users is to try and profile.
one could limit the model to the first few time steps at first (during warmup?) , to not waste so much time on regions in the parameter space that are hopeless.
Iāve tried this, and the peril is the shortened model may not provide sufficient likelihood information to influence the posterior, leading to inaccurate warmup, or even making warmup taking longer time.
However, even for moderate accuracies, the flow of the system should not actually change too much.
There is no guarantee on that. There are many ODE systems evolve dramatically differently according to how accurate we solve them.
In general controlling ODE solver with sampler performance in tandem is not fully studied. The latest manifestation of this uncharted water is @wds15 's work on adjoint ode solver where we simply donāt know how controls will affect the performance, and the next best solution is to release an experimental version so people can try it.
There is no guarantee on that. There are many ODE systems evolve dramatically differently according to how accurate we solve them.
We have a method with @bbbales2 and @avehtari where we can use a not-so-accurate but fast method during HMC and then use importance sampling to get draws from the ātrueā posterior (one obtained when using accurate but costly solver). Here is a draft version of the case study:
Iām working on updating this and doing more experiments with also runtime informations. But I have also thought a lot about how we could avoid the bad parameter regions. ODE models (with adaptive solvers) are computationally really different than for example regression, because the computational cost of one iteration heavily depends on the parameter values. So you really would want to avoid the bad regions (good initializations?) or get out of there quick. One thing that is not discussed so much than tolerances is the max_num_steps parameter. I have observed that setting it to a very low value like 1000 can often help, because bad proposals can be rejected quickly.
There are two types of bad regions: the kind that we donāt need to explore so we want to get out of asap, and that we want to explore but our integrator is stressed.
@jtimonen Thanks, this looks great! Had a quick glance and will most certainly look further into it.
@yizhang clearly you cannot have guarantees for qualitatively similar behavior for all ODEs, but if the ODE is very badly conditioned, it is also clear that a certain degree of accuracy is necessary. However this is most certainly not the case for all/most ODEsā¦
Would you happen to have examples of these badly behaving ODEs?
How to diagnose if your max_num_steps is too low is a more difficult question. I guess that for some models it could bias the inference too, if the high-probability parameter region is where it is difficult to solve the ODE.
A simple idea to get rid of bad neighborhoods would be to start a lot of chains initially, and then prune the ones that have low posterior densities and low volume.
Relying on the work that the solver needs sounds like a strategy that is bound to blow up back at you.
I think that the classical ODE solvers are not designed MCMC in mind, but rather for simulating the dynamics with fixed parameters. On the other hand, Stan is probably not designed ODE models as the primary application. Therefore there could be room for improvement if one tried to create either a solver that is designed to be used with HMC or sampling method that is designed for ODE models.
In problems like these: https://core.ac.uk/download/pdf/82745974.pdf I consider it impossible to fit the parameters in any reasonable time if we donāt have very strong prior knowledge of them, even if there was a good amount of data with little noise. Some of them just change their dynamics very much even with the tiniest change in parameter values, so the posterior is crazy.
Glad you mentioned this paper. It has brought many classical problems to modern test setting, and it highlights some of the hardship weāre faced with: thereās numerical instability when we solve ODEs, but more importantly there are ODEs that are intrinsically unstable. For the latter, we cannot hope perturbation will keep future solution close. Moreover, such systems are not easily detectable through usual numerical measures: examples like Vinograd system can have constant negative eigenvalues but still blow up, and Vinograd system is not even nonlinear.
So maybe if we want to take the path of perturbation thereās going to be some constraints on applicable problems. Some unstable ODEs can still be checked numerically, but they are exceptions rather than norm(an example come to mind is nonautonomous linear ODE with periodic coefficient, where one can apply Floquet theory should one be able to find Floquet exponents).
I donāt think that the state of affairs is so bad, but here is what I learned so far:
Avoid ODEs if you can! For example, in PK/PD systems we input drug using functions which themselves are analytically solvable or you can make - for convenience - the input function being represented as a ODE. Donāt enlarge the ODE system unless you have to!
Avoid to have more parameters than necessary. Sounds simple, but it is important to keep the number of ODEs being solved down (for each parameter you end of solving N more ODEs if N are the number of states).
Should your solution approach 0 then definitely tune the absolute tolerance to values large enough so that the ODE solver does not derail and for your problem setting to still make sense.
The relative tolerance should be increases as well in a similar way.
Donāt make the tolerances larger than 10^-4 is what I found in my problems.
A good resource for tuning tips of the tolerances is the CVODES user manual
Scale the states such that unity is a sensible number (step-size tuning starts off with this number as a reference scale basically)
I found in PK/PD problems that using an adapt_delta of 0.7 (so rather low) gave me improved performance. The net effect of that are large step-sizes which did help move the sampler around more quickly.
Know your problem well in the sense that non-linear systems can easily be quite nasty and cause bi-modality. You need to avoid that by using clever parametrisations and good priors. This is the usual advice to have a favorable āgeometryā of the problem, but it is even more important in ODE problems.
I did myself in the past consider low max_num_steps a good thing to stop the sampler from going into crazy parameter regimes, but I changed my mind by now. Itās far better to let the sampler get the value of the log-lik at these extreme values which will let the sampler āknowā that this is no-mans land.
Using a stiff sampler for warmup (bdf) and then switch for the sampling phase to a non-stiff solver (RK45) has been shown to be good for some problems. In fact, you can actually switch to the non-stiff solver starting with Phase III of the warmup.
Thatās just a collection of my personally collected experience with these problems.
If you are aiming to solve large problems with many states and/or parameters, then you should be looking for the adjoint ODE solving method for which I am about to share an experimental version of it, but it will still take some moments.