Thank you all so far for your answers and time :) I have more questions and most certainly come back to them at a later time.
@wds15 can you tell me more about the adjoint ode (or point me somewhere)? Don’t hold back :)
Thank you all so far for your answers and time :) I have more questions and most certainly come back to them at a later time.
@wds15 can you tell me more about the adjoint ode (or point me somewhere)? Don’t hold back :)
And here is a very early benchmark of this thing:
Hey again @jtimonen,
I’ve seen you said runtime information might come at some later time, but what was your first impression? How did the methods compare in terms of runtime (approximately)?
More generally and on a somewhat related note, how does the temporal/spatial discretization affect the gradient of the posterior that gets computed by Stan? I guess Stan knows nothing about the spatial discretization, so it should treat the semi-discrete heat equation just as it would an ODE?
How exactly does Stan handle the time discretization? Does it attempt to approximate the exact gradient, or does it “exactly” compute the approximate gradient (via autodiff)? Is the former even possible? I did not find a nice write-up, does it exist? (Edit: actually, CVODES is used, so I guess this is relevant? https://computing.llnl.gov/sites/default/files/public/cvs_guide-5.7.0.pdf )
Thank you!
In my understanding the autodiff for built-in Stan ODE solvers works so that there is an additional system of sensitivities (derivatives of solution wrt params), which solved simultaneously with the actual system. And if you write your own solver, then it just autodiffs though all the iterative operations (direct method). So these are different methods in terms of what gradient they are computing. I think that as we increase the tolerances, the built-in solvers get not only the solution but also the gradients more wrong (they are not exact gradients of the approximate solution). Whereas if you use the direct method, the gradients are exact for the approximate solution.
With RK45, I have sometimes got around 10 times faster runtimes with high tolerances compared to defaults, so that importance sampling is still reliable (pareto_k < 0.5). With RK45 I have quite often witnessed that higher tolerances decrease runtime, but at some point when you increase too much, it becomes slower, presumably because it is sampling a very biased posterior and also the gradients are very wrong. With BDF it has been much more difficult to predict runtime. Reasons can be that it adapts not only the step size but also the order of the method, and it has to do an implicit solve on each iteration to get the next step (this is also solved numerically and has its own tolerances, which I think cannot be controlled from Stan).
Anyway these depend much on the problem, value of max_num_steps, and parameter initialization. There can be huge variations between runtimes of chains so it is difficult to compare.
RK45 is from Boost so it is not from CVODES like BDF.
@bbbales2 might have more comments about the PDE example (and autodiff, as well as others here)
Thanks!
Based only on a short skimming of [1701.02434] A Conceptual Introduction to Hamiltonian Monte Carlo, I guess as we increase our solver tolerances, the (possible?) disconnect between the approximation of the posterior and the approximation of the gradient of the posterior may lead to more proposals being rejected (edit: and for the dynamics to be kind of weird, e.g. non energy-conserving)?
This shouldn’t be a problem for a “direct method”, but I guess that method could/would be costlier? Is that so? For the heat equation you do exactly that, don’t you?
People have had success rolling their own explicit discretizations: Costs/benefits of solving large ODE systems as systems of difference equations/in discrete time
It’s things like this that make me interested in direct method stuff. I do like the exact gradients of the approximation thing, though it’s hard to know exactly how much trouble those gradients cause.
Edit: Last sentence is meant to say that I haven’t tried and I don’t know how to diagnose this, not that it isn’t important or can’t be diagnosed somehow.
Thanks, interesting link, I’ll have a look!
Edit: Without having looked more:
Looks like low-fidelity solutions are indeed feasible?
(Yes, of course this will not work if your PDE/ODE is badly conditioned).
Edit 2:
Looks also like there should be more/simpler integrators? Preferably some that preserve some structure, such that coarse approximations are still alright? Like e.g. L-stable methods (e.g. implicit Euler) for stiff problems (e.g. the heat equation) or symplectic methods (e.g. Leapfrog) for (separable) Hamiltonian systems (such as the planetary motion example).
They’re definitely useful, but I’m scared yet about adding a bunch to the Math library. @jtimonen has added a few to @spinkney 's helpful function repo which is good: https://github.com/spinkney/helpful_stan_functions/tree/main/functions/ode
In terms of integrator development happening now, the adjoint sensitivity stuff that @wds15 posted is where we’re going. If you want to help out there, it’s quite a lot of C++ to dig through and whatnot but it is pretty interesting and it is always good to have more eyes on stuff.
Yea. If you implement a method that takes constant time steps h
, you need to somehow interpolate to get the solution at the desired output points, which can be unequally spaced and not multiples of h
. And this adds more computations and error (though the error of interpolation is not cumulative like it is for the ODE solution itself). In that repo I have linear and cubic interpolation.
Alternatively, you could decide that ok lets always take K_i steps between time points i and i+1, so there is no need to interpolate. But if the data is really dense then even K_i = 1 will be slow as the total number of steps is so large. But if data is sparse compared to changes in the ODE solution, this might be the better option.
Also another option is to take constant steps until the distance to next output time point is d < h, and then take last step with step size d. But it also suffers from the same problem that you need to take at least as many steps as there are output time points.
Something like this sounds plausible. I once added a print statement to my model block and it was printing it like hundreds of times during one iteration, not moving anywhere. So solving the system was still probably fast, but HMC was just so wrong that it couldn’t get any proposals accepted.
If that is so, and this is the main reason for the bad performance, then the adjoint mode will not necessarily save us, will it?
Well it should work with low enough tolerances, and be faster than the current autodiff (depends on number of parameters and states which one is faster).
If the problem is gradient precision, then no, the continuous adjoint problem is approximating the true gradients, not getting us the gradients of the approximation and can be off.
But I don’t think we can really firmly say a single thing is the main reason for bad performance though. Like with coarse tolerances, the bad gradients are probably one thing contributing to bad performance but also having an imprecise forward solve might be as well.
It is tempting to write an adaptive timestep solver and try to make the internal autodiff efficient. So the solution would be technically non-differentiable cause the timestep control, but maybe there is a way to glue multiple solutions together so it is actually differentiable.
Hm, you are right, I was a bit quick to conclude that.
In fact, I haven’t really seen/observed high tolerance really being an issue. Does a testcase exist showcasing that?
You might be able to decide on a refinement pattern a priori, based on your measurements. I.e. assigning (proportionally) more work to regions in time where you have higher variation/curvature.
I think the best bang-for-your-buck practice in my experience is (as @wds15 mentioned above) is scaling your parameters so they make sense on the unit normal scale, with generic std_normal()
priors.
Since this thread has many ODE focussed people on it… are PDEs a consideration for solving with Stan? To me this sounds infeasible, but maybe I am wrong. I am asking as I am right now driving the adjoint method and one numerical control parameter I was thinking to expose is according to CVODES useful in PDE applications (the so-called STALD option for bdf). For the glory details see the CVODES manual. So far I would opt to leave that feature not available as I would not expect any applications for it, but I can easily be wrong.
No you’re not. Math’s infrastructure is not designed for PDE.
I’m glad you’re dropping STALD for the moment. The original STALD paper(I think it’s in the reference of cvode user guide) shows that it’s designed for specific problems (advection-convection) where spurious oscillations could happen when certain BDF methods are used, so it will unlikely benefit problems like the heat equation @jtimonen posted earlier, and most flow solvers nowadays no longer practice this anyway. In addition, by assuming small problems size and using serial vector, Stan’s CVODES and adjoint implementation becomes irrelevant for most real-world PDE problems. Again, it comes down to infrastructure.
IMO the best bet is to interfacing external solvers, as we’ve seen here.