Adjoint ODE Prototype - RFC - Please Test

Hi all ODE Stan modelers!

We are currently in the process to add a new adjoint ODE solver to Stan. The benefit of this approach is it’s much better scalability which makes it well suited for large ODE problems. In particular problems with many parameters and a modest number of ODE states (there can still be plenty states, no worries).

The crux of the method is that it’s no free lunch in that the method is rather complex from a numerical perspective. As a result one has to deal with a lot of control parameters for the integration process and we are struggling a bit with these. That is, we are not entirely sure on which control parameter to expose to users and which ones we do not need to expose. Therefore, we would appreciate if Stan users dealing with large ODE problems to try out this new feature and give us feedback on it.

Please head to the wiki page “Adjoint ODE Prototype” which should get you started. It’s as easy as installing an experimental version of cmdstan! To emphasize - this is an experimental cmdstan version and not an official Stan release.

The experimental evaluation phase is planned to last until Easter, April 4th. With the feedback collected we hope to wrap up coding of the feature during April such that we can include the new solver in the next mid May Stan version.

Please let me know if there is anything unclear here or on the wiki instructions page.


Thanks to @bbbales2 , @rok_cesnovar , @charlesm93 , @yizhang , @betanalpha , @tadej , @stevebronder and @Funko_Unko for helping out on this so far!!!


I’ve been trying the adjoint method out on some toy examples, and at least for them I have made the following (tentative) observations:

  • The overhead for the checkpointing etc. is noticable, but the real walltime bottleneck appears to be the backwards solve to high accuracy. Looser tolerances for the backwards solve lead to significantly lower walltimes, while somewhat impacting ESS. Overly loose backwards tolerances however may majorly negatively impact walltimes and ESS.
  • Already for very moderate system sizes, the adjoint method becomes competitive. For a simple scaled predator-prey model (N states, 2xN unknown parameters + unknown initial conditions), rk45 and bdf outperform the adjoint method only for N=2.

Of course, all will depend crucially on the ODE, the priors, the number of observations and a million other parameters, so take everything with a grain of salt. I’ll share the models at some later time.


Sounds very interesting!

So for N=3 and M=6 adjoint is already faster? That’s cool.

How loose can the backward solve tolerance be in relation to the forward solve tolerances and the quadrature solve tolerances in your example? Also, did you loosen relative and absolute tolerances (I have in my tests so far always kept the relative tolerance the same for all solvers, just as in the CVODES examples)?

From what I found, the forward solve should have a higher tolerance than the “nested” ones. Does that reflect your experience?

I also found that hermite polynomials make things more stable (divergent transitions were avoided by them). Any intuition on that?

1 Like

Well it depends on more than that one parameter, but generally it can do very well. I’ve kept the absolute tolerances constant but very low for both directions.

Whenever rk45 or bdf worked with a relative tolerance of 6 orders of magnitude, the adjoint method usually performs best with the same order for the forward solve, and half the order for both backwards tolerances. But I would assume that this is highly problem dependent.

I never tried the hermite interpolation (not polynomials, should fix this in the docs), but whenever I had divergences for the adjoint methods, I usually also had divergences for the other methods. I’ll try it some time.


I just created (actually with the help of @rok_cesnovar ) a new experimental release of the adjoint ode tarball here:

@Funko_Unko can you test that one? The changes here should make adjoint faster by avoiding index checking things and making better user of vectorised Eigen things. At least I got the impression that this did indeed help in making it faster. Can you confirm? Thanks!

This release also extends the example ODE included to now have more options to play with.

1 Like

I’ll try it out later (y).

In the meantime I’ve got a comparison for anyone interested. An orbital mechanics problem with pretty wide priors, similar to what is discussed here Fitting ODE models: best/efficient practices? - #63 by Funko_Unko

Comparing the adjoint method with different backwards tolerances with the rk45 and bdf methods. A model, problem and method description is at the bottom of the image.

Take everything with a grain of salt, sometimes there are some divergent transitions. I’m not 100% sure why.

1 Like

Can’t install the current tarball (the previous version appeared to work fine).


Sys.setenv(TBB_CXX_TYPE="gcc", TAR="/bin/tar")
  # release_url = "",
  release_url = "",
  cores = 4,
  overwrite = true

gives me

--- Translating Stan model to C++ code ---
bin/stanc  --o=examples/bernoulli/bernoulli.hpp examples/bernoulli/bernoulli.stan

--- Compiling, linking C++ code ---


examples/bernoulli/bernoulli.hpp:9:20: error: 'stan::model::cons_list' has not been declared
 using stan::model::cons_list;
examples/bernoulli/bernoulli.hpp:16:20: error: 'stan::model::nil_index_list' has not been declared
 using stan::model::nil_index_list;
examples/bernoulli/bernoulli.hpp: In constructor 'bernoulli_model_namespace::bernoulli_model::bernoulli_model(stan::io::var_context&, unsigned int, std::ostream*)':
examples/bernoulli/bernoulli.hpp:83:17: error: 'nil_index_list' was not declared in this scope
       assign(y, nil_index_list(), context__.vals_i("y"),
make: *** [make/program:55: examples/bernoulli/bernoulli] Error 1

Warning message:
There was a problem during installation. See the error message(s) above. 

Sorry for that. @rok_cesnovar any idea?

Oh, I have to update the stanc3 binaries, there was a change recently (and is now on stan develop) that changed the way we do indexing, should be fixed in 5 min.


Updated now, the above link that @wds15 now work correctly. Once again so you dont need to scroll:

Sorry for the troubles, should have double checked…


That appears to work.

1 Like

… getting back to your evaluation… there is too much going on in the plot for me to get what is going on. Can you summarize a bit? I am not so much interested in best warmup strategies, but in how much faster adjoint is in terms Neff / time.

(improving warmup is always great, just not the target with adjoint ode’s for now)

Sure sorry. What you see here is basically the following:

  • Looser backwards tolerances for the adjoint method (col 1-3) may lead to better walltimes (row 1) and Neff/s (row 9) or may lead to worse walltimes with stuck chains and worse Neff/s (ibid.)

  • For this problem (non-stiff, low no. dimensions) rk45 is the clear winner, being roughly twice as efficient (Neff/s) as the bdf method, which in turn is roughly twice as efficient as the best adjoint configuration. It would be nice to have an adjoint rk45 method to compare to, but as CVODES does not provide it…

  • All methods/configurations appear to converge (eventually) towards the same posterior (which hopefully is the correct one), predicting the same trajectories (last four rows, green).

  • Most of the other plots are just diagnostics for me, sorry didn’t bother to remove them.

Note that the priors are quite wide, leaving the problem just beyond the edge of what the regular warmup can (reliably) handle, leading to a lot of chains timing out.

Thanks. The Adams solver is the non-stiff solver from CVODES. I have seen that you need relatively tight tolerances to keep Adams from derailing. The benefit from Adams is it can go up to high orders in approximation, but that is tat the risk to totally derail in case your tolerances are too loose. That said, if things turn out good in your problem, then Adams can be a lot faster. I have seen that using Adams for the backward sweep is good in some cases.

Hmm, I only ever tried adjoint-Adams at the very beginning, and I think it was roughly equivalent to bdf. But these were very non-exhaustive tests, as my computational resources are limited.

I’m still not sure what the appropriate performance metric should be.
Sure, ideally we want the highest Neff/s and convergence, but for what problems? For simple problems (tight priors) any method/configuration appears to work reasonably well. Then we could look at only the scaling with the number of states/parameters.

But, problems with tight priors are kind of uninteresting, aren’t they? If we already know the true value up to .1/1/10 %, then what’s the point? However, if the priors are not tight, then we may potentially get large runtime variations and stuck chains, and then how do we compare the performance?

As an example, for this (slightly modified) SIRP model (P=pathogen) with very tight priors (.1%), rk45 still performs best, but with bdf and the best (coarsest) adjoint configuration being almost equal.

This is with just one compartment. Black is regular warmup (what interests you), green is the incremental warmup (better, but you may ignore it). As we widen the priors (not run yet) we would expect larger and alrger variation in the runtimes for the regular warmup. Is this then still informative?

Hmm… What I had in mind is that people have a working model which they know well, but takes long to compute because of the ODE to solver with many parameters. Swapping out the current rk45/bdf/adams integrator with the adjoint method hopefully gives them the same posterior in less time (so with a higher Neff/Time).

Is the new v2 version any faster for you?

Designing artificial toy experiments is always hard, I agree. Very tight priors are probably not too interesting to explore, indeed.

BTW, it is really interesting to see that your “new” warmup ends up consistently in larger step-sizes. One way I figured a while ago leading to larger step-sizes is via targeting lower acceptance probabilities of just 0.7 or so. As long as divergent transition stay away this will make the sampler take larger steps and run faster in the examples I tried.

I’m actually not sure. It’s not “blows me out of the water” type faster, but I also haven’t run a direct comparison, mainly because I was too lazy and the ode_adjoint_tol_ctl changed slightly (for the better), and I did not want to keep 2 files of my models. I might’ll run a comparison later.

Yes, that’s something that I also noticed, and think should be good. However, it appears that this effect is not as pronounced when the prior does not dominate the posterior, yet still there.

I’ve noticed that the step size gets set lower if the spectrum of the metric is not approximated well, which appears to make sense at first glance. I.e. a worse approximation of the metric leading to more work necessary.

Don’t worry about this comparison. It’s not worth extra benchmarks to spend time on. I was hoping that you have your runtimes ready for comparison. I think I saw a nice improvement.

I did save them, but there have been some model iterations…

Hm, alright, so I’m not sure whether this test actually does what I want it to do. Using cmdstanpy I use different cmdstan paths to initialize and thus compile the models. The .hpp files are different, and I believe they get compiled at initialization and this is where all the changes should be, right @wds15 ? @mitzimorris @ahartikainen Can I just do this?

Anyhow, assuming this works, I do not see any differences in runtime for a SIR model with one compartment. So maybe it did not work as intended.

Left columns should use the new version, right the old one. Using my favorite warmup procedure. FYI, the regular warmup fails to complete for any chain within 200 seconds, while the incremental warmup lets me warmup+sample 1k draws within ~40 seconds. Do disregard this, performance benefits should show for either warmup method.

Also, I’ll upload a figure where the regular method works in a bit.

Edit 2:

Here we have the same problem with much tighter priors (.1% vs 10%). Nothing much changes, except for the regular warmup not timing out.