ODE solver behavior change from Math v3.2.0 to v3.3.0

I consider the following an important behavior change worth investigating. @bbbales2 @wds15 @syclik

#include <stan/math.hpp>
#include <test/unit/util.hpp>
#include <test/unit//math/prim/functor/lorenz.hpp>
#include <gtest/gtest.h>
#include <vector>

TEST(lorenz, bdf) {
  const lorenz_ode_fun f;
  std::vector<double> ts{0.1, 10.0};
  std::vector<double> theta{10.0, 28.0, 8.0/3.0};
  std::vector<double> y0{10.0, 1.0, 1.0};
  std::vector<stan::math::var> yv{10.0, 1.0, 1.0};
  std::vector<double> x_r;
  std::vector<int> x_i;

  auto y1 = stan::math::integrate_ode_bdf(f, yv, 0.0, ts, theta, x_r, x_i);
  auto y2 = stan::math::integrate_ode_bdf(f, y0, 0.0, ts, theta, x_r, x_i);
  std::cout << "y1[2] =  " << std::setprecision(12) << " " << y1[1][2] << "\n";
  std::cout << "y2[2] =  " << std::setprecision(12) << " " << y2[1][2] << "\n";
}

v3.2.0 result

y1[2] =   19.5720116845
y2[2] =   19.5720116845

v3.3.0 result

y1[2] =   19.5720117279
y2[2] =   19.5720116845

My system:

bash-3.2$ uname -a
Darwin Yis-MacBook-Pro.local 18.7.0 Darwin Kernel Version 18.7.0: Tue Aug 20 16:57:14 PDT 2019; root:xnu-4903.271.2~2/RELEASE_X86_64 x86_64
bash-3.2$ clang++ --version
Apple clang version 11.0.0 (clang-1100.0.33.17)
3 Likes

This is most likely related to the switch to the variadic interface. These functions were also deprecated in 3.3.0.

There was also an upgrade of Eigen in between. But the integrate_ode functions do not use Eigen right?

1 Like

There was a change with the default tolerances as well… and even with the same tolerances being set explicitly we have changed how CVODES determines things have converged (in the older version the sensitivities were not part of the convergence assessment of the newton step, now they are; so now it’s more conservative).

Given these changes I think we are good?

2 Likes

I think it’s probably this.

This used to be false but now it’s true: https://github.com/stan-dev/math/blob/develop/stan/math/rev/functor/cvodes_integrator.hpp#L299

The CVODES docs:

By default, errconS is set to SUNFALSE. If errconS=SUNTRUE then both state variables
and sensitivity variables are included in the error tests. If errconS=SUNFALSE then
the sensitivity variables are excluded from the error tests. Note that, in any event, all
variables are considered in the convergence tests.

You could flip that to SUNFALSE and see if the results match again. That is annoying the difference in prim and var here. Flipping this switch made Adams behave better right @wds15?

Regarding tolerance changed, I checked and we still have 1e-10/1e-10 in the new BDF interface and the old BDF interface. I forget if we actually made changes there. I remember discussing it though but maybe it didn’t go through.

1 Like

Thanks. I can confirm this is the cause, which brings us to another observation: there’s a performance regression of bdf solver compared with Torsten’s implementation.

Example:

template <typename T0, typename T1, typename T2>
inline std::vector<stan::return_type_t<T1, T2>>
chem(const T0& t_in, const std::vector<T1>& y,
           const std::vector<T2>& theta, const std::vector<double>& x_r,
           const std::vector<int>& x_i) {
  std::vector<stan::return_type_t<T1, T2>> dxdt(3);
  const T2& p1 = theta[0];
  const T2& p2 = theta[1];
  const T2& p3 = theta[2];
  dxdt[0] = -p1*y[0] + p2*y[1]*y[2];
  dxdt[1] =  p1*y[0] - p2*y[1]*y[2] - p3*y[1]*y[1];
  dxdt[2] =  p3*y[1]*y[1];
  return dxdt;
}

struct chem_fun {
  template <typename T0, typename T1, typename T2>
  inline std::vector<stan::return_type_t<T1, T2>>
  operator()(const T0& t_in, const std::vector<T1>& y_in,
             const std::vector<T2>& theta, const std::vector<double>& x,
             const std::vector<int>& x_int, std::ostream* msgs) const {
    return chem(t_in, y_in, theta, x, x_int);
  }
};

TEST(chem, bdf) {
  const chem_fun f;
  std::vector<double> ts{0.1, 40000.0};
  std::vector<double> theta{0.04, 1.0e4, 3.0e7};
  std::vector<double> y0{1.0, 0.0, 0.0};
  std::vector<stan::math::var> yv(stan::math::to_var(y0));
  std::vector<double> x_r;
  std::vector<int> x_i;

  std::chrono::time_point<std::chrono::system_clock> start, end;
  std::chrono::duration<double> elapsed;
  start = std::chrono::system_clock::now();
  for (int i = 0; i < 100; ++i) {
    auto y1 = stan::math::integrate_ode_bdf(f, yv, 0.0, ts, theta, x_r, x_i);
  }
  end = std::chrono::system_clock::now();
  elapsed = end - start;
  std::cout << "fwd sens sol elapsed time: " << elapsed.count() << "s\n";
}

with CVodeSetSensErrCon clause

fwd sens sol elapsed time: 0.520198s

w/o CVodeSetSensErrCon clause

fwd sens sol elapsed time: 0.380893s

So we are looking at >30% drop of performance.

no, I’m afraid not.

In general, I wonder the basis of this change, as cvodes doc specifically points out this is no silver bullet. Was the change introduced in variadic PR? I’d expect API feature PR separated from something that bifurcates numerical performance.

The performance drop was noted in the release notes: see last bullet in https://github.com/stan-dev/math/releases/tag/v3.3.0
Which is why we suggest users switch to the new ODE interface.

Yup that change was done in the variadic PR thing. During that PR we noticed that Adams-Moulton derailled entirely if one uses the old scheme. This is why it was changed along. The slowdown from bdf was seen and judged to be ok given that Adams became a lot more robust (Adams went from does not work to is ok).

One could have done that particular change with a separate PR, but the code anyway changed massively and usually robustness trumps in most cases.

Oh sure, if this is something that is messing up performance in a big way (enough to be counted as a bug), sure, we can roll it back. As I remember off the top of my head, we made the decision because the Adams solver seemed to work better in an example, and including the sensitivities in the error test seemed to make sense. We need accurate sensitivities for HMC to work, etc, and so it seems nice that when the user specifies tolerances, they’re controlling the sensitivities directly too.

Given it’s a no silver bullet thing though, I like having control of the sensitivity accuracy. Can you make up your performance loss by making your tolerances coarser now?

Like Rok said, there is a pretty sizable performance regression using the old ODE interface now that it’s just a wrapper around the new ODE interface (some costly copying), but if you’re just testing exactly the same code and same Math branch for CVodeSetSensErrCon On and Off then that’ll be the difference.

It has nothing to do the interface, as old interface is a wrapper of the new implementation, and I’m simply toggling one line on/off, not comparing old version 3.2.0 vs new version 3.3.0

I wouldn’t call it ok it if our only stiff solver loses horsepower, much as I wouldn’t cut my arms off so I can get through a narrow entrance. Have any other routes been explored?

FWIW, here’s full model based on the above ODE and new solver interface. The model is adapted from my StanCon UK tutorial https://github.com/metrumresearchgroup/torsten_stancon_2019_tutorial/tree/master/RScripts/model/chemical_reactions

functions{
  vector reaction(real t, vector x, real[] p){
    vector[3] dxdt;
    real p1 = p[1];
    real p2 = p[2];
    real p3 = p[3];
    dxdt[1] = -p1*x[1] + p2*x[2]*x[3];
    dxdt[2] =  p1*x[1] - p2*x[2]*x[3] - p3*(x[2])^2;
    dxdt[3] =  p3*(x[2])^2;
    return dxdt;
  }
}

data {
  int<lower=1> nsub;
  int<lower=1> len[nsub];  
  int<lower=1> ntot;  
  real ts[ntot];
  real obs[ntot];
}

transformed data {
  int i1[nsub];
  int i2[nsub];
  real t0 = 0.0;
  real xr[0];
  int xi[0];
  i1[1] = 1;
  i2[1] = len[1];
  for (i in 2:nsub) {
    i1[i] = i2[i-1] + 1;
    i2[i] = i1[i] + len[i] - 1;
  }
}

parameters {
  /*  p1=0.04, p2=1e4, and p3=3e7 */
  real<lower = 0> y0_mu;
  real<lower = 0> y0_1[nsub];
  real<lower = 0> sigma;
}

transformed parameters {
  vector[3] y0;
  vector[3] x[ntot];
  vector[ntot] x3;
  real theta[3] = {0.04, 1.0e4, 3.0e7};
  for (i in 1:nsub) {
    y0[1] = y0_1[i];
    y0[2] = 0.0;
    y0[3] = 0.0;
    x[i1[i]:i2[i]] = ode_bdf_tol(reaction, y0, t0, ts[i1[i]:i2[i]], 1.e-8, 1.e-10, 10000, theta);
  }
  for (i in 1:ntot) {
  x3[i] = x[i][3];
}
}

model {
  y0_mu ~ lognormal(log(2.0), 0.5);
  for (i in 1:nsub) {
    y0_1[i] ~ lognormal(y0_mu, 0.5);    
  }
  sigma ~ cauchy(0, 1); 
  obs ~ lognormal(log(x3), sigma);
}

data.R

nsub <- 8
len <- c(5, 5, 5, 5, 5, 5, 5, 5)
ntot <- 40
ts <- c(4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03,
        4.000e-01, 4.000e+00, 4.000e+01, 4.000e+02, 4.000e+03)
obs <- c(1.48e-02, 9.50e-02, 2.70e-01, 5.80e-01, 8.29e-01,
         1.38e-02, 9.41e-02, 2.60e-01, 5.20e-01, 8.10e-01,
         1.35e-02, 9.44e-02, 2.90e-01, 5.40e-01, 8.10e-01,
         1.50e-02, 9.45e-02, 2.88e-01, 5.58e-01, 8.49e-01,
         1.60e-02, 9.46e-02, 2.81e-01, 5.53e-01, 8.16e-01,
         1.29e-02, 9.48e-02, 2.90e-01, 5.51e-01, 8.13e-01,
         1.40e-02, 9.50e-02, 2.54e-01, 5.49e-01, 8.13e-01,
         1.45e-02, 9.10e-02, 2.19e-01, 5.58e-01, 8.68e-01)

init.R

y0_mu <- 2.0
y0_1 <- c(2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0)
sigma <- 0.5

with CVodeSetSensErrCon clause in stan_math

 Elapsed Time: 322.134 seconds (Warm-up)
               309.048 seconds (Sampling)
               631.182 seconds (Total)

w/o CVodeSetSensErrCon clause in stan_math

 Elapsed Time: 225.487 seconds (Warm-up)
               220.647 seconds (Sampling)
               446.134 seconds (Total)

The performance drops ~40%.

No, see above. I can also dial down absolute tolerance but I don’t think it’s supposed to help.

So the solve is no faster with 1e-8/1e-10 than 1e-10/1e-10? Can you try a few other tolerances and see if the behavior changes? If the decision is arbitrary and there is no obvious way this should go, I think we need to talk it out before we throw the switch back in the other direction.

If this is such an important switch that models are either working or not working based on where it is, we should probably consider exposing it at the API level, which seems really yucky to me, but maybe it has to be that way.

Also, thanks for posting the example code, I would normally not be so lazy and run it a few times myself and do these benchmarks, but I’m a bit behind on other stuff for the release.

Sure, I can try them out later. Note that this model is by no means special, just a typical stiff ode system that BDF should prevail. I ran into the issue when using it to benchmark torsten’s ODE solver and stan’s, and it leaves me with impression that the impact of adding that error test has not been fully evaluated.

1 Like

No, the error test has been evaluated. I was aware of the drop in performance and do not recall it to be that bad. As @bbbales2 suggests it would be good to see this at difference tolerance targets as well.

I would not mind flipping the behaviour for BDF back to what it was, but we can’t do that for Adams-Moulton as it stands. If there is another way to get Adams-Moulton stable (without requiring super tight tolerance targets), then we can also switch it there. It would be preferable to have the same behaviour for both solvers, but I don’t quite see how that can be achieved.

Exposing this switch to users… I don’t hope that we need to do that after all. It’s a nasty detail people should not need to deal with.

Is the drop documented somewhere I can take look? I’m a bit concerned on the decision-making process here. If we celebrate 10%, 15% perf improvement on some past features, drop like this should never be considered ok.

You can dig up the PRs for this. The thing is that Adams-Moulton has always been broken and this choice made things more robust. Numerical stability trumps speed. The drop in performance is not as bad for not as tight tolerances is what I would expect - but maybe you can investigate that?

I would not give up speed easily, but in this case it was for greater robustness which is ok. The story is that we worked on variadic ODE and in the course of that benchmarked this thing thoroughly. This uncovered that Adams-Moulton just does not work reliably with the implemented defaults. As a result we changed the error tolerances to include the sensitivities. This made things more robust for Adams-Moulton which then actually worked. In order to keep things consistent the bdf solver was switched as well.

I am not sure what the use is to dig that up - we should rather spend time on defining good benchmarks for ODE things. Our current benchmarks are somewhat limited.

I think an additional argument was that this way of including the sensitivities in the tolerance calculations was similar to how RK45 does it.

FWIW, with relaxed tol(rtol=1e-4 atol=1e-8) the drop is still ~40%. I’ll leave it here for revisit as I’m behind some tasks
with CVodeSetSensErrCon clause in stan_math

$ ./chem sample data file=chem.data.R init=chem.init.R random seed=987
....
 Elapsed Time: 119.652 seconds (Warm-up)
               113.899 seconds (Sampling)
               233.551 seconds (Total)

w/o CVodeSetSensErrCon clause in stan_math

 Elapsed Time: 69.675 seconds (Warm-up)
               69.854 seconds (Sampling)
               139.529 seconds (Total)

Right right, but the 1e-4/1e-8 is faster now than the 1e-10/1e-10 solve was previously.

Is the 1e-4/1e-8 solution usable? Was the attraction of 1e-10/1e-10 to begin with just that it wouldn’t be necessary to worry about the solution accuracy so much?