Dear @bbbales2 and Stan team,
Why did you not pursue this further? It seems like you coded most of the part and it was ready for merging. I tried reading all the comments, but didn’t actually understand the reasons for not pursuing this further.

In our project, we wrote a wrapper around integrate_ode_rk45 in stan::math to integrate using adjoint method, and it seems to work correctly. I am a novice in stan::math, but I can try to conduct required experiments for merging this.

I lost interest in this approach because I felt like integrating the adjoint ODE accurately ended up being quite difficult. It just didn’t end up being as fast as I wanted.

It bothered me that the continuous adjoint ODE stuff gives you approximations to the true sensitivities, when in reality HMC wants gradients of whatever it is in the log density evaluation, which is an approximation itself.

The discrete adjoint ODE stuff (autodiffing through a solver, basically) gives you exact gradients of the approximation.

Chris Rackauckas wrote a paper on the continuous vs. discrete, fwd vs. reverse thing. He’s as deep into ODEs/autodiff and the types of problems we care about as anyone (he’s done tons for ODEs in Julia with DifferentialEquations.jl). It’s here: https://arxiv.org/abs/1812.01892

Chris also made some compelling arguments for handling things like PKPD dosing better at the solver level.

If/when I go back to this, I’d start by reading Chris’s paper, experimenting with the solvers on different problems and just stick to the discrete sensitivity stuff (be it adjoint or fwd).

If you wanted to work on this, feel free. I’m happy to share opinions. My buddy @arya probably would too if he has time. Arya contacted to Chris about this stuff, and Chris was super happy to share advice. I don’t think just adding continuous adjoint sensitivity stuff is the way to go though. Too many other questions.

Can you describe your application a little bit and what benefits of the Adjoint approach you are observing?

I wanted to evaluate this approach myself for a while (I still intend to), but I got side tracked by within chain parallelization which solves the speed problem for any hierarchical ode model.

@bbbales2 the profiler dump you show above makes perfect sense to me by now. We are wasting a huge amount of overhead with the nested vector stuff to get the jacobian of the ode system here. I coded up code generation for this and I am getting 2-3x speedup by doing this. One should probably reattempt all of this with code generation . I am still thinking that Adjoint must be much better for a range of interesting problems (many ode parameters for example).

…for now I am busy with the next gen parallelism using the tbb…

Our aim is to integrate the dynamics of the system (first-order ODEs) forward in time and get the gradients of the integrated state with respect to the parameters. More specifically, these ODEs are related to Rigid Body Dynamics.

We are still in the process of merging the implementation of Adjoint Method integration. But we are motivated by Neural Ordinary Differential Equation paper, where authors reported several advantages (memory as well as time) of using the adjoint method. We have replicated nueralODE’s implementation in stan math.

Depends on the scenario, will Newmark or (even better)HHT give better accuracy and proper dissipation? In that case you’ll going to lose AD from Stan’s solvers. But since you’re already implementing your own integrator, maybe HHT-alpha + Complex step approximation gives a faster turn out(not necessarily faster solution speed). In terms of adjoint, what is your Lagrangian?

The adjoint method treats the integration method being used as a BlackBox. We can use Newmark, HHT or any integrator. We concern ourselves with getting fast derivatives with respect to the integrated state, which is the main advantage of the Adjoint method.

I am sorry but I didn’t understand this point. Which case are you talking about?

Not very sure, but have a look at page 3 on this Stanford tutorial which derives adjoint method for ODEs via Lagrangian. As mentioned, we have used the implementation and derivation from NeuralODE paper (look Appendix B.1 and B.2).