ODE solver behavior change from Math v3.2.0 to v3.3.0

yizhang · October 15, 2020, 8:47am

I consider the following an important behavior change worth investigating. @bbbales2 @wds15 @syclik

#include <stan/math.hpp>
#include <test/unit/util.hpp>
#include <test/unit//math/prim/functor/lorenz.hpp>
#include <gtest/gtest.h>
#include <vector>

TEST(lorenz, bdf) {
  const lorenz_ode_fun f;
  std::vector<double> ts{0.1, 10.0};
  std::vector<double> theta{10.0, 28.0, 8.0/3.0};
  std::vector<double> y0{10.0, 1.0, 1.0};
  std::vector<stan::math::var> yv{10.0, 1.0, 1.0};
  std::vector<double> x_r;
  std::vector<int> x_i;

  auto y1 = stan::math::integrate_ode_bdf(f, yv, 0.0, ts, theta, x_r, x_i);
  auto y2 = stan::math::integrate_ode_bdf(f, y0, 0.0, ts, theta, x_r, x_i);
  std::cout << "y1[2] =  " << std::setprecision(12) << " " << y1[1][2] << "\n";
  std::cout << "y2[2] =  " << std::setprecision(12) << " " << y2[1][2] << "\n";
}

v3.2.0 result

y1[2] =   19.5720116845
y2[2] =   19.5720116845

v3.3.0 result

y1[2] =   19.5720117279
y2[2] =   19.5720116845

My system:

bash-3.2$ uname -a
Darwin Yis-MacBook-Pro.local 18.7.0 Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64 x86_64
bash-3.2$ clang++ --version
Apple clang version 11.0.0 (clang-1100.0.33.17)

rok_cesnovar · October 15, 2020, 9:02am

This is most likely related to the switch to the variadic interface. These functions were also deprecated in 3.3.0.

There was also an upgrade of Eigen in between. But the integrate_ode functions do not use Eigen right?

wds15 · October 15, 2020, 9:37am

There was a change with the default tolerances as well… and even with the same tolerances being set explicitly we have changed how CVODES determines things have converged (in the older version the sensitivities were not part of the convergence assessment of the newton step, now they are; so now it’s more conservative).

Given these changes I think we are good?

bbbales2 · October 15, 2020, 11:25am

I think it’s probably this.

This used to be false but now it’s true: https://github.com/stan-dev/math/blob/develop/stan/math/rev/functor/cvodes_integrator.hpp#L299

The CVODES docs:

By default, errconS is set to SUNFALSE. If errconS=SUNTRUE then both state variables
and sensitivity variables are included in the error tests. If errconS=SUNFALSE then
the sensitivity variables are excluded from the error tests. Note that, in any event, all
variables are considered in the convergence tests.

You could flip that to SUNFALSE and see if the results match again. That is annoying the difference in prim and var here. Flipping this switch made Adams behave better right @wds15?

Regarding tolerance changed, I checked and we still have 1e-10/1e-10 in the new BDF interface and the old BDF interface. I forget if we actually made changes there. I remember discussing it though but maybe it didn’t go through.

yizhang · October 15, 2020, 6:37pm

Thanks. I can confirm this is the cause, which brings us to another observation: there’s a performance regression of bdf solver compared with Torsten’s implementation.

Example:

template <typename T0, typename T1, typename T2>
inline std::vector<stan::return_type_t<T1, T2>>
chem(const T0& t_in, const std::vector<T1>& y,
           const std::vector<T2>& theta, const std::vector<double>& x_r,
           const std::vector<int>& x_i) {
  std::vector<stan::return_type_t<T1, T2>> dxdt(3);
  const T2& p1 = theta[0];
  const T2& p2 = theta[1];
  const T2& p3 = theta[2];
  dxdt[0] = -p1*y[0] + p2*y[1]*y[2];
  dxdt[1] =  p1*y[0] - p2*y[1]*y[2] - p3*y[1]*y[1];
  dxdt[2] =  p3*y[1]*y[1];
  return dxdt;
}

struct chem_fun {
  template <typename T0, typename T1, typename T2>
  inline std::vector<stan::return_type_t<T1, T2>>
  operator()(const T0& t_in, const std::vector<T1>& y_in,
             const std::vector<T2>& theta, const std::vector<double>& x,
             const std::vector<int>& x_int, std::ostream* msgs) const {
    return chem(t_in, y_in, theta, x, x_int);
  }
};

TEST(chem, bdf) {
  const chem_fun f;
  std::vector<double> ts{0.1, 40000.0};
  std::vector<double> theta{0.04, 1.0e4, 3.0e7};
  std::vector<double> y0{1.0, 0.0, 0.0};
  std::vector<stan::math::var> yv(stan::math::to_var(y0));
  std::vector<double> x_r;
  std::vector<int> x_i;

  std::chrono::time_point<std::chrono::system_clock> start, end;
  std::chrono::duration<double> elapsed;
  start = std::chrono::system_clock::now();
  for (int i = 0; i < 100; ++i) {
    auto y1 = stan::math::integrate_ode_bdf(f, yv, 0.0, ts, theta, x_r, x_i);
  }
  end = std::chrono::system_clock::now();
  elapsed = end - start;
  std::cout << "fwd sens sol elapsed time: " << elapsed.count() << "s\n";
}

with CVodeSetSensErrCon clause

fwd sens sol elapsed time: 0.520198s

w/o CVodeSetSensErrCon clause

fwd sens sol elapsed time: 0.380893s

So we are looking at >30% drop of performance.

no, I’m afraid not.

In general, I wonder the basis of this change, as cvodes doc specifically points out this is no silver bullet. Was the change introduced in variadic PR? I’d expect API feature PR separated from something that bifurcates numerical performance.

rok_cesnovar · October 15, 2020, 6:43pm

The performance drop was noted in the release notes: see last bullet in Release v3.3.0 (28 July 2020) · stan-dev/math · GitHub
Which is why we suggest users switch to the new ODE interface.

wds15 · October 15, 2020, 6:54pm

Yup that change was done in the variadic PR thing. During that PR we noticed that Adams-Moulton derailled entirely if one uses the old scheme. This is why it was changed along. The slowdown from bdf was seen and judged to be ok given that Adams became a lot more robust (Adams went from does not work to is ok).

One could have done that particular change with a separate PR, but the code anyway changed massively and usually robustness trumps in most cases.

bbbales2 · October 15, 2020, 6:59pm

Oh sure, if this is something that is messing up performance in a big way (enough to be counted as a bug), sure, we can roll it back. As I remember off the top of my head, we made the decision because the Adams solver seemed to work better in an example, and including the sensitivities in the error test seemed to make sense. We need accurate sensitivities for HMC to work, etc, and so it seems nice that when the user specifies tolerances, they’re controlling the sensitivities directly too.

Given it’s a no silver bullet thing though, I like having control of the sensitivity accuracy. Can you make up your performance loss by making your tolerances coarser now?

Like Rok said, there is a pretty sizable performance regression using the old ODE interface now that it’s just a wrapper around the new ODE interface (some costly copying), but if you’re just testing exactly the same code and same Math branch for CVodeSetSensErrCon On and Off then that’ll be the difference.

yizhang · October 15, 2020, 7:08pm

It has nothing to do the interface, as old interface is a wrapper of the new implementation, and I’m simply toggling one line on/off, not comparing old version 3.2.0 vs new version 3.3.0

I wouldn’t call it ok it if our only stiff solver loses horsepower, much as I wouldn’t cut my arms off so I can get through a narrow entrance. Have any other routes been explored?

yizhang · October 15, 2020, 8:14pm

FWIW, here’s full model based on the above ODE and new solver interface. The model is adapted from my StanCon UK tutorial https://github.com/metrumresearchgroup/torsten_stancon_2019_tutorial/tree/master/RScripts/model/chemical_reactions

functions{
  vector reaction(real t, vector x, real[] p){
    vector[3] dxdt;
    real p1 = p[1];
    real p2 = p[2];
    real p3 = p[3];
    dxdt[1] = -p1*x[1] + p2*x[2]*x[3];
    dxdt[2] =  p1*x[1] - p2*x[2]*x[3] - p3*(x[2])^2;
    dxdt[3] =  p3*(x[2])^2;
    return dxdt;
  }
}

data {
  int<lower=1> nsub;
  int<lower=1> len[nsub];  
  int<lower=1> ntot;  
  real ts[ntot];
  real obs[ntot];
}

transformed data {
  int i1[nsub];
  int i2[nsub];
  real t0 = 0.0;
  real xr[0];
  int xi[0];
  i1[1] = 1;
  i2[1] = len[1];
  for (i in 2:nsub) {
    i1[i] = i2[i-1] + 1;
    i2[i] = i1[i] + len[i] - 1;
  }
}

parameters {
  /*  p1=0.04, p2=1e4, and p3=3e7 */
  real<lower = 0> y0_mu;
  real<lower = 0> y0_1[nsub];
  real<lower = 0> sigma;
}

transformed parameters {
  vector[3] y0;
  vector[3] x[ntot];
  vector[ntot] x3;
  real theta[3] = {0.04, 1.0e4, 3.0e7};
  for (i in 1:nsub) {
    y0[1] = y0_1[i];
    y0[2] = 0.0;
    y0[3] = 0.0;
    x[i1[i]:i2[i]] = ode_bdf_tol(reaction, y0, t0, ts[i1[i]:i2[i]], 1.e-8, 1.e-10, 10000, theta);
  }
  for (i in 1:ntot) {
  x3[i] = x[i][3];
}
}

model {
  y0_mu ~ lognormal(log(2.0), 0.5);
  for (i in 1:nsub) {
    y0_1[i] ~ lognormal(y0_mu, 0.5);    
  }
  sigma ~ cauchy(0, 1); 
  obs ~ lognormal(log(x3), sigma);
}

data.R

nsub <- 8
len <- c(5, 5, 5, 5, 5, 5, 5, 5)
ntot <- 40
ts <- c(4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03)
obs <- c(1.48e-02, 9.50e-02, 2.70e-01, 5.80e-01, 8.29e-01,
         1.38e-02, 9.41e-02, 2.60e-01, 5.20e-01, 8.10e-01,
         1.35e-02, 9.44e-02, 2.90e-01, 5.40e-01, 8.10e-01,
         1.50e-02, 9.45e-02, 2.88e-01, 5.58e-01, 8.49e-01,
         1.60e-02, 9.46e-02, 2.81e-01, 5.53e-01, 8.16e-01,
         1.29e-02, 9.48e-02, 2.90e-01, 5.51e-01, 8.13e-01,
         1.40e-02, 9.50e-02, 2.54e-01, 5.49e-01, 8.13e-01,
         1.45e-02, 9.10e-02, 2.19e-01, 5.58e-01, 8.68e-01)

init.R

y0_mu <- 2.0
y0_1 <- c(2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0)
sigma <- 0.5

with CVodeSetSensErrCon clause in stan_math

 Elapsed Time: 322.134 seconds (Warm-up)
               309.048 seconds (Sampling)
               631.182 seconds (Total)

w/o CVodeSetSensErrCon clause in stan_math

 Elapsed Time: 225.487 seconds (Warm-up)
               220.647 seconds (Sampling)
               446.134 seconds (Total)

The performance drops ~40%.

yizhang · October 15, 2020, 8:26pm

No, see above. I can also dial down absolute tolerance but I don’t think it’s supposed to help.

bbbales2 · October 15, 2020, 8:34pm

So the solve is no faster with 1e-8/1e-10 than 1e-10/1e-10? Can you try a few other tolerances and see if the behavior changes? If the decision is arbitrary and there is no obvious way this should go, I think we need to talk it out before we throw the switch back in the other direction.

If this is such an important switch that models are either working or not working based on where it is, we should probably consider exposing it at the API level, which seems really yucky to me, but maybe it has to be that way.

bbbales2 · October 15, 2020, 8:42pm

Also, thanks for posting the example code, I would normally not be so lazy and run it a few times myself and do these benchmarks, but I’m a bit behind on other stuff for the release.

yizhang · October 15, 2020, 8:45pm

Sure, I can try them out later. Note that this model is by no means special, just a typical stiff ode system that BDF should prevail. I ran into the issue when using it to benchmark torsten’s ODE solver and stan’s, and it leaves me with impression that the impact of adding that error test has not been fully evaluated.

wds15 · October 16, 2020, 8:12am

No, the error test has been evaluated. I was aware of the drop in performance and do not recall it to be that bad. As @bbbales2 suggests it would be good to see this at difference tolerance targets as well.

I would not mind flipping the behaviour for BDF back to what it was, but we can’t do that for Adams-Moulton as it stands. If there is another way to get Adams-Moulton stable (without requiring super tight tolerance targets), then we can also switch it there. It would be preferable to have the same behaviour for both solvers, but I don’t quite see how that can be achieved.

Exposing this switch to users… I don’t hope that we need to do that after all. It’s a nasty detail people should not need to deal with.

yizhang · October 16, 2020, 8:25am

Is the drop documented somewhere I can take look? I’m a bit concerned on the decision-making process here. If we celebrate 10%, 15% perf improvement on some past features, drop like this should never be considered ok.

wds15 · October 16, 2020, 10:38am

You can dig up the PRs for this. The thing is that Adams-Moulton has always been broken and this choice made things more robust. Numerical stability trumps speed. The drop in performance is not as bad for not as tight tolerances is what I would expect - but maybe you can investigate that?

I would not give up speed easily, but in this case it was for greater robustness which is ok. The story is that we worked on variadic ODE and in the course of that benchmarked this thing thoroughly. This uncovered that Adams-Moulton just does not work reliably with the implemented defaults. As a result we changed the error tolerances to include the sensitivities. This made things more robust for Adams-Moulton which then actually worked. In order to keep things consistent the bdf solver was switched as well.

I am not sure what the use is to dig that up - we should rather spend time on defining good benchmarks for ODE things. Our current benchmarks are somewhat limited.

bbbales2 · October 16, 2020, 1:23pm

I think an additional argument was that this way of including the sensitivities in the tolerance calculations was similar to how RK45 does it.

yizhang · October 16, 2020, 5:31pm

FWIW, with relaxed tol(rtol=1e-4 atol=1e-8) the drop is still ~40%. I’ll leave it here for revisit as I’m behind some tasks
with CVodeSetSensErrCon clause in stan_math

$ ./chem sample data file=chem.data.R init=chem.init.R random seed=987
....
 Elapsed Time: 119.652 seconds (Warm-up)
               113.899 seconds (Sampling)
               233.551 seconds (Total)

w/o CVodeSetSensErrCon clause in stan_math

 Elapsed Time: 69.675 seconds (Warm-up)
               69.854 seconds (Sampling)
               139.529 seconds (Total)

bbbales2 · October 16, 2020, 5:37pm

Right right, but the 1e-4/1e-8 is faster now than the 1e-10/1e-10 solve was previously.

Is the 1e-4/1e-8 solution usable? Was the attraction of 1e-10/1e-10 to begin with just that it wouldn’t be necessary to worry about the solution accuracy so much?

Topic		Replies	Views
Integrate_ode_bdf vs integrate_ode_rk45 for integrand that doesn't reference state variable Developers math	2	907	February 24, 2018
CmdStan 2.27.0 release candidate General	36	2159	June 10, 2021
Default ODE tuning parameters in Torsten Developers ode	6	587	April 8, 2020
Updating ode to new syntax. Exception: ode_rk45: ode paramers and data is inf Modeling rstan , fitting-issues , ode , cmdstanr	1	60	July 17, 2024
Undefined behavior in ODE solver Developers rstan	22	983	January 23, 2020

ODE solver behavior change from Math v3.2.0 to v3.3.0

Related topics