Switching ODE integrators between the warmup and the sampling phase

A few weeks ago, @wds15 proposed switching ODE integrators between the warmup and the sampling phase. The idea is the following: use the bdf solver during warmup, when the chain moves into more “extreme” regions of the parameter space; but once we’re in the typical set, use a non-stiff solver.

ODE-models are delicate, to say the least, so I struggled a bit to set up a good computer experiment. But here’s a pharmacokinetic example, with code and everything, where the method works quite nicely.

ode-experiment2.pdf (218.1 KB)
Michaelis_MentenPK.stan (1.6 KB)

Note that if you make the priors tighter, you can prevent the chains from wandering into pathological regions of the parameter space and the non-stiff solver works just fine. On the other hand, loosen the priors too much, and even the stiff solver fails. This might be worth exploring further.

There may be better examples to learn about switching solvers.

I also played with a Friberg-Karlsson model. Here the system is stiff, and you need to keep using the bdf solver during the sampling phase (at least in the prior / data setup I used).

4 Likes

This looks super cool! As you say, it depends a lot on details, but the example you show is very relevant for pk models.

Thanks for working it out. Very nice.

Yeah, I really like the perspective of ODEs across a range of parameter values, as opposed to reasoning about the system only for a single parameter value (which we implicitly do when we say a system is stiff or non-stiff). It’d be good to try out more examples.

Tagging @betanalpha, @billg and @yizhang who may be interested in this experiment.

Without running a model what evidence does a user have so he could choose BDF + RK45 vs simple BDF or simple RK45?

Yes! Stiffness is defined only for a particular model configuration and so a given model can be both stiff and non-stiff depending on the particular model configuration considered. Consequently a model is “fully non-stiff” only if it is non-stiff for the model configurations encountered during warmup and those in the typical set of the posterior distribution. In reality models might be non-stiff in the typical set but stiff in the tails encountered during warmup.

This “warmup stiffness” was major hurdle back when we only had the RK45 solver implemented. In this circumstances tweaking the initial HMC configuration helped ensure more stable behaviors during warmup that didn’t stress the ODE solver (in particular setting the initial stepsize to small values was critical to avoid nearly infinite loops). It’s also one of the reasons why I started to see how terrible the common recommendation for Cauchy prior densities was in practice – for more see Section 3.1.2 of https://betanalpha.github.io/assets/case_studies/weakly_informative_shapes.html.

Understanding the relative stiffness during warmup and equilibrium is really important for the performance and robustness of ODE models, and more experiments to identify useful strategies for algorithm tuning and robust prior modeling are welcome!

1 Like

A while ago I tried to get around the warmup stiffness by using RK45, but with limiting the maximal number of steps of the integrator between time-points. This did lead to the desired behavior to reject extreme parameter draws quickly… but it did massively frustrate the sampler, since it is a lot better to solve the system and do return a log-lik value. It‘s important information for the sampler. Thus, I am happy to see that this solver switching works.

@betanalpha Do you have any suggestions whether we can push the use of the non-stiff solver even into the region of the warmup phase? I mean, the first phase is to find the typical set. During this phase a stiff solver is really needed. Then we have windows of doubling size where we adapt the global tuning parameters. Would you consider it feasible to run possibly the last of these windows already with a non-stiff solver? This is a long phase of the warmup phase, but should the exploration of the sampler not any more push things to the tails of the distribution, then this could drastically speed up things even further.

@yizhang What would you suggest to calculate as a measure for stiffness? The condition number of a linearized system? I could imagine that simulations from the prior (or prior + a subset of the data set to model) can give useful information how applicable this approach is… and in practice one could anyway try out this approach on a first base model. Should it work fine, then one can keep using it, since the iterative modeling approach we use in practice should not too drastically change things (and even if things change, then execution times go up and sampler diagnostics should tell their story).

What about scalar ODE like y'=-30y, t=[0, 10] ? What can we say about stiffness from scalar ODE’s condition number?

Hamiltonian Monte Carlo finds the typical set very, very quickly (usually within the first hundred warmup iterations if not sooner) but only when supplied with accurate gradients. Poor approximations are cheap but they prevent the sampler from enjoying this rapid convergence and hence do more harm than good.

To avoid stiff model configurations one has to be careful about initialization of the Markov chains and the Markov transition itself. Initializing closer to the typical set, for example, will avoid problems when the typical set is nestled within a neighborhood of non-stiff model configurations. When a good initialization can’t be found one needs to robusify the computation.

Again accurate ODE solvers are as critical early in initialization as latter in equilibrium, if not more so. Another step that can help is running more precise Hamiltonian trajectories with smaller deviations from the exact energy level set – when the error is large enough those deviations can bring the numerical trajectories into stiff regions even if the sampler started in a non-stiff region and there are paths from the initialization to the typical set that are entirely non-stiff. This is why reducing the initial symplectic integrator step size can be helpful.

If the adaptation windows are set correctly and if the typical set is entirely contained within a non-stiff region then yes the stiff solver will be importantly only in the initial warmup phase. The goal of the warmup procedure is to be in the typical set once the doubling windows start, and once we’re in the typical set then we can adopt computation appropriate for the typical set.

That said the warmup procedure is heuristic – it’s proven to work well over a relatively large class of models but there are zero guarantees that the Markov chains will always have reached the typical set once the second phase begins. Consequently assuming that a non-stiff solve will be sufficient there can be dangerous; indeed switching to a non-stiff solver prematurely can mess up warmup altogether.

Really the only way to know whether a non-stiff solver can be used earlier is to run with a stiff solver first and then see when the system appears to loosen up. But then you’ve already spent the computation by the time you know whether you can use the cheaper method! This is the paradox of heuristic-empirical methods.

2 Likes

This came up in the Stan meeting, and it reminded me of the conversation @wds15, @yizhang and I had over here: Adams Moulton ODE solver in Stan useful?

We initially had two ODE solvers (RK45 and BDF) and the question of which to choose was posed to the modeler as ‘is your ODE stiff or not stiff’?

In that thread we get a third solver (Adams), which is apparently non-stiff, but when @wds15 and I tried to use the stiff/non-stiff rule and apply Adams to problems RK45 worked fine on, things went awry.

Anyway so we need to pick our solver somehow, and the thing I’d like to add is that however we end up making decisions between solvers (or generally configuring the solvers), the solvers in Stan are rather specific to Stan. Like, someone can develop their ODE out of Stan (and probably should, at some level), but the things that are true for solving this model in an R library might not translate.

I think the timing stuff @rok_cesnovar has been proposing for the language will be really useful here.

And this definitely seems like some sort of workflow thing to me. Even after the Adams/RK45/BDF discussions we had in the other thread, I still don’t know the checklist I would go to pick one of those things (other than use them all and see which is the fastest and compute reference solutions with super tight tolerances to compare against, which is all somewhat annoying).

3 Likes

Relevant to this thread, the “stiff vs non-stiff” dichotomy could be misleading. Since @bbbales2 mentioned the AM vs rk45 thread, let me show what I mean using the model discussed there, aka ODE example: people’s choice award winner lotka-volterra model. (@bbbales2 initially found out that AM method is very slow, and I pointed out possible cause & remedy here, so here we’ll use the revised tolerance).

Recap: the ODE system

  real[] dz_dt(real t, real[] z, real[] theta, real[] x_r, int[] x_i) {
    real u = z[1];
    real v = z[2];
    real alpha = theta[1];
    real beta = theta[2];
    real gamma = theta[3];
    real delta = theta[4];
    real du_dt;
    real dv_dt;
    du_dt = (alpha - beta * v) * u;
    dv_dt  = (-gamma + delta * u) * v;
    return { du_dt, dv_dt };
  }

and solver call

  real z[N, 2] = integrate_ode_adams(or rk45)(dz_dt, z_init, 0, ts, theta,
                         rep_array(0.0, 0), rep_array(0, 0),
                         1e-5, 1e-5, 1000);

Both @bbbales2 and I are getting rk45 3-4 times faster than AM. On my machine it’s 15 sec(rk45) vs 45 sec(AM). And we all agree that this equation is non-stiff.

Now, let’s add a dummy loop inside ODE system

  real[] dz_dt(real t, real[] z, real[] theta, real[] x_r, int[] x_i) {
    real u = z[1];
    real v = z[2];
    real alpha = theta[1];
    real beta = theta[2];
    real gamma = theta[3];
    real delta = theta[4];
    real du_dt;
    real dv_dt;
    for (i in 1:100) {
       du_dt = (alpha - beta * v) * u;
       dv_dt  = (-gamma + delta * u) * v;
    }
    return { du_dt, dv_dt };
  }

This doesn’t change the output or the nature of the system, as it’s all nice non-stiff. But when ran on my machine, rk45 took

 Elapsed Time: 145.035 seconds (Warm-up)
               116.302 seconds (Sampling)
               261.336 seconds (Total)

and AM took

 Elapsed Time: 88.5801 seconds (Warm-up)
               100.434 seconds (Sampling)
               189.014 seconds (Total)

The water is muddier than “stiff vs nonstiff”.

2 Likes

Hahahaha, it is the default for me

But that’s expected, isn’t it? AM is great at approximating the function being integrated while RK45 is not doing that and instead calling the ODE RHS a gazillion times instead.

BTW, we have found a way to make AM robust even with the lower tolerances. We now include the sensitivities in the error checks more often in CVODES. This pushes down more quickly the step size and as a result AM runs stable.

So it looks to me that AM should be better whenever the problem is non-stiff and evaluating the ODE RHS is super expensive… like PK/PD systems where we have to handle a complicated dosing input compartment. Interesting, never thought about it like this.

1 Like