Adjoint ODE Prototype - RFC - Please Test

Updated now, the above link that @wds15 now work correctly. Once again so you dont need to scroll: https://github.com/rok-cesnovar/cmdstan/releases/download/adjoint_ODE_v2/cmdstan-ode-adjoint-v2.tar.gz

Sorry for the troubles, should have double checked…

2 Likes

That appears to work.

1 Like

… getting back to your evaluation… there is too much going on in the plot for me to get what is going on. Can you summarize a bit? I am not so much interested in best warmup strategies, but in how much faster adjoint is in terms Neff / time.

(improving warmup is always great, just not the target with adjoint ode’s for now)

Sure sorry. What you see here is basically the following:

  • Looser backwards tolerances for the adjoint method (col 1-3) may lead to better walltimes (row 1) and Neff/s (row 9) or may lead to worse walltimes with stuck chains and worse Neff/s (ibid.)

  • For this problem (non-stiff, low no. dimensions) rk45 is the clear winner, being roughly twice as efficient (Neff/s) as the bdf method, which in turn is roughly twice as efficient as the best adjoint configuration. It would be nice to have an adjoint rk45 method to compare to, but as CVODES does not provide it…

  • All methods/configurations appear to converge (eventually) towards the same posterior (which hopefully is the correct one), predicting the same trajectories (last four rows, green).

  • Most of the other plots are just diagnostics for me, sorry didn’t bother to remove them.

Note that the priors are quite wide, leaving the problem just beyond the edge of what the regular warmup can (reliably) handle, leading to a lot of chains timing out.

Thanks. The Adams solver is the non-stiff solver from CVODES. I have seen that you need relatively tight tolerances to keep Adams from derailing. The benefit from Adams is it can go up to high orders in approximation, but that is tat the risk to totally derail in case your tolerances are too loose. That said, if things turn out good in your problem, then Adams can be a lot faster. I have seen that using Adams for the backward sweep is good in some cases.

Hmm, I only ever tried adjoint-Adams at the very beginning, and I think it was roughly equivalent to bdf. But these were very non-exhaustive tests, as my computational resources are limited.

I’m still not sure what the appropriate performance metric should be.
Sure, ideally we want the highest Neff/s and convergence, but for what problems? For simple problems (tight priors) any method/configuration appears to work reasonably well. Then we could look at only the scaling with the number of states/parameters.

But, problems with tight priors are kind of uninteresting, aren’t they? If we already know the true value up to .1/1/10 %, then what’s the point? However, if the priors are not tight, then we may potentially get large runtime variations and stuck chains, and then how do we compare the performance?

As an example, for this (slightly modified) SIRP model (P=pathogen) with very tight priors (.1%), rk45 still performs best, but with bdf and the best (coarsest) adjoint configuration being almost equal.

This is with just one compartment. Black is regular warmup (what interests you), green is the incremental warmup (better, but you may ignore it). As we widen the priors (not run yet) we would expect larger and alrger variation in the runtimes for the regular warmup. Is this then still informative?

Hmm… What I had in mind is that people have a working model which they know well, but takes long to compute because of the ODE to solver with many parameters. Swapping out the current rk45/bdf/adams integrator with the adjoint method hopefully gives them the same posterior in less time (so with a higher Neff/Time).

Is the new v2 version any faster for you?

Designing artificial toy experiments is always hard, I agree. Very tight priors are probably not too interesting to explore, indeed.

BTW, it is really interesting to see that your “new” warmup ends up consistently in larger step-sizes. One way I figured a while ago leading to larger step-sizes is via targeting lower acceptance probabilities of just 0.7 or so. As long as divergent transition stay away this will make the sampler take larger steps and run faster in the examples I tried.

I’m actually not sure. It’s not “blows me out of the water” type faster, but I also haven’t run a direct comparison, mainly because I was too lazy and the ode_adjoint_tol_ctl changed slightly (for the better), and I did not want to keep 2 files of my models. I might’ll run a comparison later.

Yes, that’s something that I also noticed, and think should be good. However, it appears that this effect is not as pronounced when the prior does not dominate the posterior, yet still there.

I’ve noticed that the step size gets set lower if the spectrum of the metric is not approximated well, which appears to make sense at first glance. I.e. a worse approximation of the metric leading to more work necessary.

Don’t worry about this comparison. It’s not worth extra benchmarks to spend time on. I was hoping that you have your runtimes ready for comparison. I think I saw a nice improvement.

I did save them, but there have been some model iterations…

Edit:
Hm, alright, so I’m not sure whether this test actually does what I want it to do. Using cmdstanpy I use different cmdstan paths to initialize and thus compile the models. The .hpp files are different, and I believe they get compiled at initialization and this is where all the changes should be, right @wds15 ? @mitzimorris @ahartikainen Can I just do this?

Anyhow, assuming this works, I do not see any differences in runtime for a SIR model with one compartment. So maybe it did not work as intended.

Left columns should use the new version, right the old one. Using my favorite warmup procedure. FYI, the regular warmup fails to complete for any chain within 200 seconds, while the incremental warmup lets me warmup+sample 1k draws within ~40 seconds. Do disregard this, performance benefits should show for either warmup method.

Also, I’ll upload a figure where the regular method works in a bit.

Edit 2:

Here we have the same problem with much tighter priors (.1% vs 10%). Nothing much changes, except for the regular warmup not timing out.

When I make changes to math source I wipe out things with a make clean-all for comdstan when I want to make sure. It should work automatic, but I don’t trust the automatic…

(and really don’t mind doing comparisons with the old, but make sure you use the new, of course)

You wanted a comparison with the old, or did I misunderstand?

I posted a version 2 of the adjoint ODE in the hope that it’s faster than v 1… so yes, a speed comparison between the version to verify from someone else that the changes were for the good.

But as I said, never mind on that. Don’t make it an effort or go into depth with that.

Thanks for implementing this. When thinking about how to select different defaults for the tolerances and other arguments, and what to expose for users, I think the following things should be taken into account (also I would like to know the answer to these)

  1. Which arguments affect the accuracy of the solution y_hat? If the solution is wrong, all depending probabilities will be wrong and the user will be sampling from a biased posterior without getting any warnings about it. And you cannot diagnose this with ESS, Rhat, or any of the MCMC diagnostics in Stan.

  2. Which arguments affect the accuracy of the sensitivities of the solution? These accuracy of these affects how HMC/NUTS works, but you are still sampling from the correct posterior if y_hat is correct (or you can check that you are by diagnosing it with Stan).

  3. How do the other arguments affect the behaviour of the sampler? Especially, what happens if max_num_steps is reached? And not just in terms of what stan_math reports but what does the sampler do if it occurs? Importantly, can you accidentally sample from the wrong posterior by setting this too low?

The speed comparisons are only relevant if you are sampling from the same posterior (i.e y_hat doesn’t change and you aren’t allowed to skip relevant parameter regions).

1 Like

So what do you have to report in terms of results?

The design doc should answer your questions mostly, see Adjoint ode design by wds15 · Pull Request #37 · stan-dev/design-docs · GitHub

In brevity:

  • the forward pass determines y_hat
  • the backward pass calculates the partials of (essentially) the log-lik wrt to the states y
  • the backward quadrature problem calculates the partials of the log-lik wrt to the parameters

What is certainly special about the adjoint ODE method is that each of these components now has their own “numerical life” in the sense of having different techniques we get them and potentially differing tolerances.

I like that document. The good thing to report in speed comparisons is probably ESS/s, but to create informative speed comparisons between forward and adjoint method, I would then create only test cases where both methods

  • use the same solver in forward pass
  • with same tolerances for forward pass
  • have same max_num_steps for forward pass

because then we know that they compute the same y_hat. One can then play with the other parameters to find out good values for them. This might be what you were already doing, I just wasn’t sure what affects y_hat.

Yea, I would see this as an advantage, as you can play with the numerics of the backward solve without messing up y_hat, and with forward method you need to have the same control parameters for the whole extended system.

Hm, generally we cannot expect to sample from the “same” posterior if we are using numerical integrators, can we?

I guess we would like to have some scalable and difficult reference problem with known (analytical) posterior. Do we have such a thing? The harmonic or damped oscillator might be just a little bit too easy.

@wds15 explained the parameters, so I won’t.

  • Generally, throughout the test cases, all methods agreed with each other (in the eye-norm). If there were a large bias due to too loose tolerances, this should show for settings with very tight priors. We did not observe this. Of course, the appropriate tolerances are problem specific, and maybe we should advice users to always do a preliminary exploration of the parameter space as you did in your case study. For this, it would be very convenient if we could do all of this using a single stan file.

  • I believe max_num_steps was so high that it was never reached (1e5)

  • We only have a BDF/Adams adjoint method, but “generally” the rk45 forward method performs much better than the BDF/Adams forward method, which is why this is usually the method I compare against. However, even with equal “forward” tolerances, the solution/step size adaptation will not be the same for the forward and adjoint BDF/Adams method, as the sensitivity errors factor into the step size adaptation (if I’m not mistaken).

Edit:

  • I don’t think reaching max_num_steps is ever a good idea.
  • Problems where the prior dominates the posterior should not be (majorly) affected by inaccuracies in the gradient computation, while problems with wide priors appear to need special handling to ensure convergence in reasonable time.

Adams performs a lot better when dealing only with the N ODE states and not the entire forward problem at once. So I don’t think I would do it like that.

… but do you have some results to report? That would be great.

While I share the desire to “get the same posterior” and the things you say are correct… but I think that we must simplify things in order to make some progress. Looking at how the tolerances put HMC off the rails is extremely difficult.

I would setup things so that HMC will stay (very likely) on its tracks and then vary some of the tuning parameters. That is already enough information and already enough complexity to deal with. I would avoid pushing it too far.

The art here is to do the “right” simplifications, but still get some useful information from the experiments.

Also: The comparison to forward is by now not too useful, I think. At least we have clear examples already where adjoint blows away forward. What is interesting though is if the simplified interface ode_adjoint_tol is already showing gains over forward whenever adjoint should be faster. So it should not depend on the exact tuning parameters of the more complex function in order for adjoint to win over forward (that would be my hope).

If for some system now adjoint is better, then we would like to know which tuning knobs matter (and which don’t) - and which tuning knobs we forgot.

My point was that if two methods use the exact same numerical procedure to compute y_hat, then they are sampling the same posterior.

But I realize now that you are correct here so y_hat won’t be exactly the same even when using same method and same tolerances. So it is hard to do fair comparisons.

If this was the case, wouldn’t it make sense to set max_num_steps to like 1e30 or infinity.