ODE solver behavior change from Math v3.2.0 to v3.3.0

yizhang · October 16, 2020, 5:45pm

Barely, as the solution at the end is at the same scale of atol=1e-8. But I’m not getting your point, relaxed tol sure is faster, but that’s not why we did the above comparison on toggling on/off CVodeSetSensErrCon.

bbbales2 · October 16, 2020, 5:53pm

Well I guess since things are more conservative now, 1e-10/1e-10 is a higher precision solve now than it was previously. In your case, the timestep is apparently limited by the precision of the sensivities (which we need to be accurate for HMC). This means you should be able to get away with lower tolerances now, because effectively before now you were running with lower tolerances.

What exactly those lower tolerances are I do not know. I assume it’s not 1e-4/1e-8, but I do not now. That is why we’re comparing performance between low and high tolerance calculations, at least.

wds15 · December 5, 2020, 10:24pm

So I wanted to come back to this and see if we need to change things or not. What I did is take the harmonic oscillator example for which we have analytical solutions. Then I compared the absolute errors when comparing the ODE solution (state and parameter sensitivites) to the analytical solution. I repeated this for Adams and BDF for the current master (2.25.0) and a patched version of 2.25.0 (where the old behaviour is used without including tolerances in the error check):

BDF:

Adams:

In this example I did set the target absolute error to 1E-7 (the line at -7). One can see that one gets a lot worse actual precision and the patched version which is about an order of magnitude worse. Calculating the mean of the log absolute error confirms this:

for Adams:
 name     master patched
  <chr>     <dbl>   <dbl>
1 y1        -7.09   -6.03
2 y1_chi    -7.53   -6.46
3 y1_omega  -5.94   -4.93
4 y2        -7.02   -6.03
5 y2_chi    -7.47   -6.46
6 y2_omega  -5.90   -4.90


for BDF:
  name     master patched
  <chr>     <dbl>   <dbl>
1 y1        -6.53   -5.67
2 y1_chi    -6.97   -6.11
3 y1_omega  -5.42   -4.55
4 y2        -6.48   -5.58
5 y2_chi    -6.93   -6.03
6 y2_omega  -5.31   -4.45

So basically we have an additional order of precision in the new version when we set equal absolute target precision.

If I now run the above chemical reaction network example with the original precision targets of

rel_tol <- 1E-8
abs_tol <- 1E-10

then I get these execution times for the patched version:

patched with higher precision:
 Elapsed Time: 72.897 seconds (Warm-up)
               53.885 seconds (Sampling)
               126.782 seconds (Total)

using an order less of precision targets as:

rel_tol <- 1E-7
abs_tol <- 1E-9

I get for the current master version this timing:

master with lower precision:
 Elapsed Time: 75.396 seconds (Warm-up)
               58.231 seconds (Sampling)
               133.627 seconds (Total)

which is a very comparable performance.

To me this suggests that our new default tolerance scheme brings us closer to what we promise to the user. With the old scheme the errors were about an order of magnitude worse. If one wants to get back the old speed, one simply needs to lower the precision targets.

This is yet another example showing how sensitive ODE performance is to tolerances. I would suggest that we try out to enable setting for users the absolute tolerances for each state variable which is supported by CVODES. This is obviously a super important (subject-specific) tuning parameter.

Other thoughts?

@rok_cesnovar these tests were run with the export stuff you posted… very convenient…

yizhang · December 6, 2020, 9:58am

I’m assuming the difference between “master” and “patched” is the CVodeSetSensErrCon switch @bbbales2 mentioned above. In that case, there are a misconception and a fallacy in this reply. Let’s first talk about the fallacy.

Harmonic oscillator is not supposed to be solved by either bdf or adams, simply rk45 would do. So I don’t care which BDF implementation does a better job on this problem. More importantly, using benchmark from this model as proof that “master” gets us “an additional order of precision” for all the problems is simply wrong, especially when apply this claim to stiff problems that BDF is supposed tackle. I’ll show a counter example in a sec.

Now the misconception:

Given a method, nothing brings us an “additional order of precision” except possibly tuning tolerances. Trying to achieve “additional order of precision” by checking sensitivity error control in general practice is hopeless.

So I think using a nicely behaved oscillation problem to show BDF & Adams are to be more accurate in “master” and then bumping up tolerance in another stiff problem in order to match wall-time performance between “master” & “patch” is bad practice of benchmarking.

Counterexample of “additional order of precision”. The example can be so trivial it doesn’t even need to be coupled:

\frac{d}{dt}(y_1, y_2, y_3)^T = (-15y_1, -20y_2, -30y_3)^T,\\ t\in (0, 1), \\ (y_1(0), y_2(0), y_3(0)) = (1, 1, 1)

Analytical solution at t=1 is:

y_1(1) = 3.0590232 × 10^{-7}\\ y_2(1) = 2.0611536 × 10^{-9}\\ y_3(1) = 9.3576229 × 10^{-14}

Since the problem is stiff, we solve is using BDF. Old version with rtol = 1.E-9, atol = 1.E-13 gives us

y1: 3.059026e-07
y2: 2.061165e-09
y3: 9.358256e-14

if new version really gives “additional order of precision”, we should be able to get similar accuracy with rtol=1.E-8, atol=1.E-12, and yet we get

y1: 3.05904e-07
y2: 2.061223e-09
y3: 9.361747e-14

which loses 1-2 significant figures compared to previous tolerance. This trivial example shows the “additional order of precision” claim is simply wrong.

wds15 · December 6, 2020, 1:07pm

I picked the example as the analytic solution is coded up and worked out. If that is over-simplifying in your eyes, then that can happen. Even you would not choose these algorithms for this problem, the problem can still serve as an example here, since I do not care about the performance for this evaluation. I would certainly not claim any “proof” from what I did which would generalize to any problem, not at all. However, I still think that it has been a very valuable evaluation of the new error tolerance scheme and I would expect that the behavior seen in this example applies to many other problems as well.

My conclusion does not seem to be too far off given that with one order of magnitude less of a tolerance you get comparable performance and results - and from the example I provided I would think that you do get a solution which has similar actual accuracy, but we definitely do not know for sure, of course.

The example you give lacks to me a very crucial point in it’s evaluation: In the framework of the Stan sampler we throw tons of parameters at the solver at vastly different values. So any specific parameter configuration is almost meaningless. The solver must be able to cope with parameter space exploration during warmup - and that can get very wild. I forgot to specify that in the example, but I used log-normal distributions for the parameters.

I still do think that the change we did with the tighter error control is absolutely fine. The user can always lower the tolerance targets to recover the old performance (and also the old actual precision). At least this is what my case study has shown and I would expect that this applies to many other examples as well. Given that absolute tolerances are so critical for good performance I think we should allow users to pass instead of a scalar for the absolute tolerance a vector of length equal to the number of states which would control the absolute tolerance for each state. Such a feature looks to me doable and could give advanced users a lot more control over the integration.

Niko · March 17, 2022, 6:28am

Hey @wds15, can I revive this thread to ask you whether you remember what exactly happened to Adams-Moulton? Stan sampled very badly? I wouldn’t think that the ODE solutions diverged?

wds15 · March 17, 2022, 9:05am

The Sundials solvers where changed to be more strict with error tolerance checks. With the release you mention we started to include the sensitivities into the error checks while that was a 2-step process before (so get the main states to converge, then converge the sensitivities / now we ask CVODES to get everything to converge in a single go).

But yes, AM did give me nothing useful unless tolerances where set very tight in the old scheme. With the new scheme AM works for reasonable tolerances.

I also did some simulations where showed to me that with the new setting you can basically see it as more conservative approach and setup your tolerances somewhat looser to get the same speed as before (and similar actual precision).

How come you get back to this?

Niko · March 17, 2022, 9:55am

This means what happened exactly? Stan just kept lowering the stepsize trying to achieve some acceptance rate, but failed?

Yeah, I believe that.

I was worried that weird things could happen for models involving adaptive ODE-solvers. @jtimonen found this thread. Before that we had been discussing whether it’s sensible to consider the sensitivities in the stepsize adaptation. I believe it is not, partially because (I believe) this makes it impossible to recover the exact ODE trajectory considered during HMC sampling using CmdStan.

wds15 · March 17, 2022, 9:56am

AM did not return a useful solution with the old pattern unless setting tolerance around 10^-10 when run onto problems I come across.

Niko · March 17, 2022, 10:03am

(I edited/expanded the above response.)

This means that Stan did not explore the posterior? Or does this mean that the individual ODE trajectories were wrong?

wds15 · March 17, 2022, 10:09am

Yes, crazy large RHat.

Niko · March 17, 2022, 10:42am

Thanks!

arya · March 24, 2022, 4:55am

So does Stan just use the same tolerance you specify for the states on the augmented sensitivity states? Or do the ODE solvers in Stan not consider the error in the augmented sensitivity states when deciding step size? I skimmed the CVODES manual and saw that there’s an option for both.

Niko · March 24, 2022, 7:26am

Last I looked, Stan’s rk45 just uses the standard error control for the augmented system. I don’t know whether it uses the exact same tolerances for the sensitivities and for the states, but either way the returned states differ when the system gets solved with sensitivities (eg during hmc) or without (eg anytime cmdstan outputs anything).

For the cvodes solvers @jtimonen had a closer look, and I think starting at the version change discussed in this thread, Stan’s CVODES solvers use the sensitivity error estimates for the solver adaptation (I think that’s stepsize and order).

AFAIK, even if you tell CVODES not to consider the sensitivities for “error control” (via the error estimates), the sensitivities are still considered to check “convergence” (of the nonlinear solve I think) and can thus impact solver adaptation indirectly.

But someone else (eg someone who did the Stan implementation such as @wds15 or @yizhang?) can probably tell you more.

wds15 · March 24, 2022, 8:44am

That’s correctly summarised.

arya · March 24, 2022, 3:20pm

Ah, I see. Thanks for the explanation. In theory, it should be possible to not consider the tolerances for the sensitivities in the nonlinear solve either. Since the nonlinear solve is probably just a Newton solver that is making sure the errors fir all states are below a certain threshold. I have no idea how much that would buy you though.

Niko · March 24, 2022, 4:01pm

Yes, AFAIK it’s a modified Newton iteration where the Jacobian gets updated only as needed, but I’m not sure.

I think ideally we should be able to reproduce the states exactly, independent of whether the augmented system gets solved or not. I think for “the” adjoint method this is generally so, because the sensitivities only come into play in the reverse pass.

For (adaptive) implicit forward methods I think an additional potential problem is that the nonlinear solve will introduce additional gradient errors, which can then apparently make hmc inefficient.

arya · March 25, 2022, 5:28am

Yeah that would be nice. Sometimes I implement my ODE in the deSolve project in R since it gives good diagnostics of how many steps the solver is taking and many jacobian calculations.

I think so too. It’s the same problem as with the adaptive stepsize where branching and while loops cause discontinuous likelihoods since if you move a little in parameter space the newton iteration might need an extra iteration. The upside is that it’s only discontinuous to a certain precision. Technically I think you could argue even explicit methods are discontinuous if you’re talking about continuity beyond numerical precision. One of the error codes you can get some times with BDF is that you’ve asked for too much precision in the solution more than the machine will allow.

Niko · March 25, 2022, 6:19am

Yeah, it has happened to me as well that for simple problems I could not just restart my solver (rk45 I think) at t=1 because a step size of 1e-16 was too large, which is just beyond unreasonable.

But aside from obvious failures such as these where the solver complains, this is also indicative of the solvers sometimes spending a lot of work on what I presume has to be excessive precision.

yizhang · March 25, 2022, 5:57pm

Currently Stan uses a “staggered” approach in which states & sensitivities are solved one after the other. We cannot do that without considering nonlinear solver tolerance control.

I don’t think this is possible.

Why is it limited to forward sensitivity methods? In adjoint solver we still need solve sates & adjoints using nonlinear solver, wouldn’t those be “additional gradient errors”? In addition, one actually IS able to control the nonlinear solver residual error by adjusting the safety factor, but again it could make solution inefficient in another way.

CVODES can do that, odeint allows user to write output function to do that, the problem is it will clutter stan output. IIRC there’s an github discussion for this.

This is because rk45 uses a default starting step size without apparent justification, juxtaposing rk45’s specific step control algorithm it could be blown out of water. odeint does allow specifying a minimum step size that we didn’t implement. If you can share the simple ode so we can add to the unit test it’ll help future diagnostics. This is part of why I overhauled the ode unit tests system to make it easier to add new odes.

In general we are stuck in the contention of flexibility vs user-friendliness. On one hand we want to give non-expert users easy access to the solvers, on the other hand we want allow power user to do knob-turning for better performance, as CVODES actually enables us to. It’s less a solver issue but a language issue. Something like

cvodes_with_options {
set_cvodes_option_foo;
set_cvodes_option_bar;
baz = ode_bdf(....);
}

could help ameliorate it.

Topic		Replies	Views
Integrate_ode_bdf vs integrate_ode_rk45 for integrand that doesn't reference state variable Developers math	2	846	February 24, 2018
Source code for integrate_1d Developers integrator	6	474	December 16, 2020
Compling error with ODE functions General	6	1005	October 10, 2017
R code generation issues Developers	5	670	March 2, 2017
Vector-valued functions with manual gradients in external C++ General cmdstan , ode	11	628	August 30, 2023

ODE solver behavior change from Math v3.2.0 to v3.3.0

Related Topics