@betanalpha, @vianeylb, and I just posted a preprint on a discretized version of the adjoint method developed for Ordinary Differential Equation.
The paper reviews methods for continuous ODEs, and develops an analog technique for difference equations, of the type
u_{n + 1} - u_{n} = \Delta(u_n, \psi, n)
One illustrative example is computing sensitivities for Hidden Markov Models (HMM) with discrete latent states. It’s a rather new perspective on the matter, and “magically” it motivates the same differentiation algorithm that you would get as a byproduct of Expectation-Maximization, which is at least of theoretical interest.
In line with this work, we’re developing some new functions to do HMM in Stan, see this issue on GitHub.
Would this be a solution to the problem that for ODEs the continuous adjoint method computes gradients of the true continuous solution, whereas one would actually want the gradients of the approximate discretized solution? @bbbales2?
Yeah this is what people call the discrete adjoint problem. The additional question that I think @anon75146577 is alluding too is could timestep control be differentiable.
It might be differentiable still but I’d be surprised if it could be written and u_t = f(u_{t-1}). But it might be possible on an extended state space. (Tbqh I had a bad experience with Butcher’s book and just gave up on adaptive ode methods)
The solver does not have to be written as u_t = f(u_{t-1}), if by f we mean the differential function of the ODE . The studied finite difference equations are of type u_{n+1} = u_{n} + \Delta_n(u_n, \psi, n) so I guess this could be used for any method where \Delta_n(u_n, \psi, n) does not depend on u_{n-1}, \ldots, u_{0}. And the RK45 method satisfies this, but the CVODES solver is a linear multistep method so that one maybe not.
If it was in that form then the solver would necessarily be an explicit solver.
As for the time step control being differentiable, most solvers will have something like a while loop where they’ll compute an approximation of local error repeatedly then decide the step size based on that. @bbbales2 and I talked about this a couple years ago so I’m a little hazy on it, but because those steps are based on discrete decisions, the derivative of the numerical solution with respect to parameters may be discontinuous, which could ultimately mean the gradient of the likelihood is discontinuous. It may or may not have much of a practical impact on inference. I’m not sure. I guess it depends on how much precision of the gradient evaluation affects the integrity of samples in HMC.
I haven’t gotten a chance to do a deep read but cool stuff!
@charlesm93 how much of a speedup could we expect to see on HMM’s? I might be trying to fit some HMMs with many time points in the near future. I think currently evaluating the log-likelihood scales as O(N K^2) where N is the number of time points and K is the number of states. Does that scaling change?
@jtimonen What is the benefit, from an inference perspective, of getting the gradient of the approximate numerical solution, as opposed to the approximate gradient of the true solution?
@arya The \mathcal O(NK^2) scaling doesn’t change. However, you get rid of the autodiff overhead and a high constant factor (+ high memory usage).
Maybe @bbbales2 can pose the problem better ('cause he’s the one that introduced it to me), but I guess it is that if your numerical solver has some error compared to the true solution (which is always the case) then for HMC one should use the gradient of that erroneous solution, to not screw up the potential and lose energy in HMC or something.