Adjoint ode: different behaviour in cmdstan-ode-adjoint-v2 vs cmdstan-2.27.0

My colleagues and I have been having some trouble with a model that uses the recently-released adjoint ode solver and would really appreciate some help.

The background is that we are making an application for fitting Bayesian statistical models of measurements of metabolic networks using cmdstan via cmdstanpy. The source code lives here and has documentation here. In order to connect our model’s parameters with its measurements we need to solve a steady state problem: given a set of kinetic parameters and boundary conditions, what concentration profile of internal metabolites will be stable? We solve this problem by representing the network’s dynamics as a system of ODEs and simulating for a reasonably long time from a biologically plausible starting state.

The ODE-solving step is by far the slowest part of our model, and involves more parameters than state variables, so we were hopeful that the adjoint solver would speed up our models. Indeed this is what we observed when we tested the new solver under cmdstan release cmdstan-ode-adjoint-v2. However, since the official release we have encountered a new type of error and bad sampler behaviour, which we did not see before.

Specifically, whereas under version cmdstan-ode-adjoint-v2 we are able to achieve more or less stable sampling, under versions cmdstan-2.27.0-rc2 and cmdstan-2.27.0 something seems to go wrong with the adaptation during the initial phase, with the result that the step size rapidly gets very small and almost all iterations have divergences. In addition we see the following error under versions cmdstan-2.27.0-rc1 and cmdstan-2.27.0 but not under version cmdstan-ode-adjoint-v2:

Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception: ode_adjoint_tol_ctl: ode parameters and data[3] is inf, but must be finite! (in '/Users/tedgro/dtu/projects/minimal_maudfail/model.stan', line 313, column 4 to line 358, column 40)
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

I think this means that, somewhere, an ode state variable (i.e. the firstnon-function argument to ode_adjoint_tol_ctl) is infinity, which seems like the kind of thing that would cause a problem! My first thought was that I needed to tighten some tolerances, but I didn’t manage to find a configuration that avoids the error, and this doesn’t explain the differences we see between cmdstan versions.

We have prepared an example setup here that reproduces this behaviour. Running make stan-environment should download and build all the relevant cmstan versions (cmdstan-ode-adjoint-v2, cmdstan-2.27.0-rc1 and cmdstan-2.27.0) in the cmdstan folder, then python will run one of our models under each version, with the results saved in the output folder. The Stan input lives in the folder data - for example you can see the ode tolerances that we used here. If you don’t want to adjust anything some results from my laptop (intel macbook pro running macos Mojave, cmdstanpy 0.9.76, Apple LLVM version 10.0.1 (clang-1001.0.46.4)) are already uploaded in the output folder.

I’m posting mainly to ask the people who worked on this feature, and anyone else who knows about the adjoint method, whether you have seen this kind of behaviour or know why it might be happening. Did anything significant change between cmdstan-ode-adjoint-v2 and cmdstan-2.27.0-rc1?

Any comments, thoughts or questions would be very gratefully received!

1 Like

I am glad to see that the adjoint solver is doing useful stuff for you (or at least it did).

Nothing crosses my mind immediately in what changes. Thanks for that git repo… I will try to look into it, but will need some time for this due to other things going on.

EDIT: Being an R person you would help me in turning the python into R as will otherwise have to spend time on the python bit… or dummy proof python instructions for a Mac user would also help.

EDIT2: one thing you can try to debug already yourself is to test the data and parameters for being finite before the call of the adjoint ode solver. Maybe that tells you already something as then you can pull out more context as to how this happens and why it does not happen under the old version.

EDIT3: Without having read things… your description sounds as if you have a really bad initialisation. Is that something you can improve by providing better initial starting points? In addition decreasing the initial step-size somewhat can help with keeping the sampler in the vicinity of the initial for at least a little bit.

Thanks for the suggestions!

I have expanded the instructions and added some comments to the python script, and we will work on adding an equivalent R script using cmdstanr, which I think shouldn’t be too hard. Let me know if there is anything that could be explained better.

Good idea - I tried printing the ode input just before the ode_adjoint_tol_ctl call (see latest version of model.stan and results) and couldn’t find anything infinite. I guess this points towards something going wrong during the ode call?

Thanks for pointing this out - I thought the inits were all being set based on a previous good sample but on inspection some weren’t. It’s now corrected and starts at a spot with relatively high log probability, but unfortunately this didn’t fix the issue, even with the initial step size set to 1e-6.

I’ve implemented an R script that is available in the repository that behaves the same way as the python script with no need to create an environment. It will install the various CmdStan versions as you run through it. Hope it helps :)

During my attempts I’ve also noticed that the odd behaviour doesn’t really appear with the shorter runs and with 2.27.0-* versions it results in smaller and smaller step sizes (approaching 1E-67) which occasionally go to -inf

EDIT: Sorry, the step size goes to -nan and the lp goes to -inf. I’ve attached an example below because it does not consistently occur.
methionine_cycle_failure_run.csv (367.4 KB)