Adjoint ODE behaviour

Hello everyone,

Our work is currently involved with a Bayesian SIR model with a time-varying infection rate on daily COVID-19 mortality data. The SIR ODE function keeps information about the three compartments (S,I,R) and a dummy compartment is used to store the time-varying infection rate for further use.

Letting N be the number of states and P be the number of model parameters, we have N >> P. We have used the RK45 solver and the analysis is successfully executed (i.e. convergence diagnostics and posterior outputs are reasonable) in about 46 hours under the setting (395 time points, chains = 2, warmup = 2000, main iters = 5000), using rstan 2.21.2 on a Windows 10 machine (Intel Core i7, 16GB RAM).

We wish to extend this work to an age-structured SIR model with a time-varying infection rate. Acknowledging the computational limitations of the analysis with a RK45 solver, a preliminary run using the RK45 solver was performed with (chains = 2, warmup = 500 iters, main = 700 iters). After 2 weeks, the progress was at about 60%.

Previous discussions (link1, link2, link3) have pointed towards the Adjoint ODE solver, expected to scale better under the N >> P situation. Taking a step back, we moved to cmdstan 2.27.0 and attempted to implement the Bayesian SIR model (no stratification) with the Adjoint ODE solver [ode_adjoint_tol_ctl(), cmdstan version 2.27.0, cmdstanr 0.4.0.90, R version 4.1.0). If this works, then we’ll move to the age-structured model.

While the adjoint solver results into a fast implementation (~45 minutes!), a very similar informational message to what @Teddy_Groves1 has previously reported here comes up and all iterations had divergences. We’ve tried different different initialization points, but the issue persists.

Ideas from the people who have worked on this feature (and anyone else familiar with this feature) on how to configure the adjoint solver would be much appreciated.

I attach part of the .R file related to MCMC sampling and part of the corresponding .stan program; additionally, the error messages I receive after a short test run, for illustration purposes, if this can help. Also tagging @wds15, @charlesm93, @Funko_Unko in this.

Thanks!

Messages during compilation:

Messages during MCMC sampling:

Test_Adjoint_v2_forum.stan (4.8 KB)
AdjointODE_test.R (3.2 KB)

3 Likes

Hi!
A number of things could be happening.

For starters, they may be steps other than switching to an adjoint method that you can take to improve your computation. The stiff BDF solver may be more suited than RK45; certain reparameterizations of the ODE can help too. Some of these concepts are discussed in this paper: https://onlinelibrary.wiley.com/doi/10.1002/sim.9164.

Next, I would make sure the adjoint method, as implemented, returns results consistent with the non-adjoint method for reasonable parameter values. I would compare the results (i) without computing the gradients (i.e. using generated quantities) and (ii) with gradient computation. If the second case fails, it may be that the solver is having a hard time solving the adjoint system, which results in CVODES errors. This may then be an invitation to change the tuning parameters of the ODE.

1 Like

Thanks for sharing this. Can you please run once the “diagnose” method from CmdStan to test that the gradients are correct for your model? Also, @Teddy_Groves1 reported that one of the first test versions of the adjoint solver worked well for him…could you test that possibly (you would have to dig a bit into the digital world to find these, sorry). I am starting to get worried that there is an issue, so thanks for compiling a simple to run example as it looks…will have to follow up on this. Sorry for the trouble.

1 Like

I’m familiar with the Bayesian workflow paper, thanks for pointing it out! To be more precise, a solver with the trapezoidal rule has proven to be faster than RK45 in our case. Nevertheless, we are still talking about 2 days of CPU time under the setting I mentioned in my post. Either with RK45 or with trapezoidal rule, the CPU time under the age-structured model will be prohibitive for the interested user.

For starters, they may be steps other than switching to an adjoint method that you can take to improve your computation. The stiff BDF solver may be more suited than RK45; certain reparameterizations of the ODE can help too. Some of these concepts are discussed in this paper: https://onlinelibrary.wiley.com/doi/10.1002/sim.9164

A test run using the BDF solver with 4 iterations would show the message “1000 transitions using 10 leapfrog steps would take 2070 seconds.”, while with the trapezoidal rule this would be 20-40 seconds, making the BDF solver implementation even slower for our experiment.

I would compare the results (i) without computing the gradients (i.e. using generated quantities) and (ii) with gradient computation. If the second case fails, it may be that the solver is having a hard time solving the adjoint system, which results in CVODES errors. This may then be an invitation to change the tuning parameters of the ODE.

According to the posterior output under the adjoint solver, the effective sample size for all the reported model parameters is =1, the values of the estimated \hat{R} approach the stratosphere, and the posterior means disagree with our benchmark results. So, the second case fails, if I understand correctly.

Can you please run once the “diagnose” method from CmdStan to test that the gradients are correct for your model?

Sure. It appears that the csv files from the adjoint ODE setting that I mentioned in my original post were deleted from the temporary folder. I re-ran the same MCMC experiment with one chain this time. Here’s what I get with adapt_delta = 0.99 and max_treedepth = 19:

Also, @Teddy_Groves1 reported that one of the first test versions of the adjoint solver worked well for him…could you test that possibly (you would have to dig a bit into the digital world to find these, sorry).

Seems like the “cmdstan-ode-adjoint-v2” was working for them. After some digging, it looks like the R code to use this test version is available resildes in this minimal example that was prepared for that forum post. @Teddy_Groves1 can you please confirm?

will have to follow up on this

It would be so cooll if we sort this out, yes!

Hi there! I think that script should still work for installing the cmdstan-ode-adjoint-v2 version of cmdstan. If not this worked on my computer just now:

cd ~/.cmdstan
wget -c https://github.com/rok-cesnovar/cmdstan/releases/download/adjoint_ODE_v2/cmdstan-ode-adjoint-v2.tar.gz -O - | tar -xz
cd cmdstan-ode-adjoint-v2
make build 
# ...wait for it to build
export CMDSTAN=~/.cmdstan/cmdstan-ode-adjoint-v2
1 Like

Hi!

Oh… I don’t mean the hmc diagnose method from cmdstanr, but the “diagnose” method from cmdstan which tests the gradients… @rok_cesnovar does cmdstanr support that?

Sebastian

Yes it does.

res <- mod$diagnose(data = ..., init = ... )
res$gradients()

And you get a d dataframe with the gradients.

1 Like

Thanks @rok_cesnovar.

@wds15 Good morning Sebastian. It appears the diagnose method fails in this case and I cannot retrieve the dataframe with the gradients. Screenshot with the error message below. In the meantime, I am working on the cmdstan-ode-adjoint-v2 implementation. I’ll get back to you asap.

Lampros

The installation of cmdstan-ode-adjoint-v2.tar.gz was successful on my local machine, using the guidelines from the .R script. Thanks!

1 Like

does the diagnose step work with the v2 adjoint tar file?

Thanks for all the debugging!

I have successfully installed cmdstan-ode-adjoint-v2 using the following. So hopefully, I’m using the test version of the adjoint solver in my run.

cmdstanDirBase <- paste0(Sys.getenv("HOME"),'/.cmdstan/')

ιnstall_cmdstan(dir        = cmdstanDirBase,
                release_url="https://github.com/rok-cesnovar/cmdstan/releases/download/adjoint_ODE_v2/cmdstan-ode-adjoint-v2.tar.gz")
                check_toolchain = FALSE)

cmdstanVersion <- "cmdstan-ode-adjoint-v2"
cmdstanDir     <- file.path(cmdstanDirBase, cmdstanVersion)

set_cmdstan_path(cmdstanDir)

I’ve tried different combos of function parameter values for the solver, but I’m always getting back the same error messages as above. The diagnose step did not work either, I’m afraid…

Can you (and @Teddy_Groves1 ) maybe try the following:

  1. use the released 2.27.0
  2. grab the file https://raw.githubusercontent.com/stan-dev/math/bugfix/adjoint-ode-odd/stan/math/rev/functor/ode_adjoint.hpp
  3. replace the respective file in 2.27.0 with the just grabbed file
  4. rerun your model

This is a wild guess… so far I cannot reproduce the issues you guys are having. Smaller examples would be great (which ideally run fast) to tickle the issue.

good news… I think I now can see the error…and the above suggestion wont get you anywhere. I am quite in the dark as to what is going wrong…but I can debug for now…

1 Like

^if you can share anything happy to help

1 Like

Some help would be great… let me modify the minimal example such that it works a bit easier. Will ping you.

2 Likes

I think everything that has been discussed in the CoDatMo thread should be relevant again.

No matter which solver you use, as soon as you have non-constant (over time) parameters, it is very very (very) likely that you have to split up the solution of the ODE by hand into separate time intervals in which the ODE parameters are smooth. And don’t do linear or binary or whatever search to identify the correct beta[i].

Also, don’t forget about finding an efficient representation of the effects you want to capture.

HI @stevebronder !

I slightly modified script.R such that it runs without rstudio and simplified things.

The rc version runs ok and takes really long to compute, the 2.27.0 release does reject quickly the draws, because the data is inf or whatever. To me it looks like we are having memory issues. I will undo things looking fishy to me in comparison to the working version and hope to arrive at a working version. If you see something obvious, let me know!

Best,
Sebastian

Thanks for looking into this! I’ll try and make a smaller example.

Hey @stevebronder !

I have now made a bit of progress and pushed things to a git branch along with a draft PR:

Let’s discuss over there. @Teddy_Groves1 a simpler example would be great, but for now we are good, I think.

Sebastian

1 Like