ODEs: Cost of checking

Hi!

Triggered by a recent report we are considering to add a check_finite to be applied to the output of every ODE RHS call. I was skeptical at this as it will probably add a lot of overhead since the ODE RHS gets called a lot of times. As an example benchmark to test the cost of this extra checking, I took the hornberg ODE example which is very stiff and is with 8 states. I ran this case a number of times with the bdf and the RK45 integrator with different settings:

  1. 3 replications per case
  2. running multiple instances in parallel of the same test case (1, 4 and 8)
  3. with and without checking (the difference between the two ODE RHS is a call to stan::math::check_finite)
  4. the problem was run without any sensitivity such that AD won’t obfuscate the result

I did point 2 to have an idea on how concurrent chains interact with my 4-core laptop which has hyperthreading.

Attached is a graphical result of this run. In numbers I get an average 7% performance degradation with 4 cores concurrency and ~6% with 8 concurrent runs. For a single concurrent run I don’t get a measurable difference, although I should increase the sample size here, but I would be surprise to get a difference which is noticeable.

All in all, I would not like to slow down the ODE code for all users by up to 7% just for checking over and over (checking the initial state though is a good compromise as I find). Moreover, I think this example shows that checking does cost performance and we should consider to have a non-checking Stan mode which can be enabled by experts.

Best,
Sebastian

perf_checking_hornberg.pdf (7.9 KB)
perf_checking_hornberg.csv (5.9 KB)

Can we move this topic to developers?

Where can I find the example?

Sure, you can move it.

I have stuff on a branch locally in stan-math. I can push it to github if you like. Right now I am rerunning it such that I get the same number of data-points per case of concurrency.

Sebastian

Just pointing me to the example would be useful.

There’s also the human cost of checking that you’re not accounting for. I’m hoping the actual runtime the way I would implement it is much less, but we’ll see.

The confusing error can also happen when derivatives go to NaN, not just when the size of the derivative is wrong. I agree we should check the size of derivative once and not check again, but I’m worried about terms going to NaN and having the human cost of debugging it.

Please send over the ODE and any relevant scripts. Adding it on a branch is perfectly fine.

Hi!

Doing some more tests with a single cpu runs shows that there is also a 3-4% loss in speed - and using 24 concurrent processes leads to 10% slow down (checking vs non-checking).

The code with the test case is here

https://github.com/stan-dev/math/blob/ode-perf-test-hornberg/test/unit/math/rev/mat/functor/integrate_ode_perf_test.cpp

That file needs to be compiled once with runTests.py. After that you can call the runtests.sh (located in the root of that branch) to kick off all the different cases. The analysis.R creates plots… this is not pretty code, but does the job.

Sebastian

where’s the actual .stan file?

There is no stan file at all. I am benchmarking the integrate_ode_bdf/rk45 directly using random parameter draws which are generated using lognormals.

I’m trying to hunt down the realistic use case of a user writing a Stan ode function and not being able to debug.

I think a good compromise is to check the initial state for size and NaN coming from the ODE RHS once. This should catch most glitches without almost any cost.

I completely disagree. If there are any numerical issues, this will cause the integrator to first hang, then spit out a non-sensical error message.

If there are ways to make Sundials and Boost spit out better error messages, that would probably suffice.

I’m creating simple stand-alone examples that demonstrate the behavior.

@wds15: good news. This is only a problem with integrate_ode_bdf. The Sundials solver behaves really poorly with bad inputs. I’m going to go back and update the Stan issue #548 with code.

I’m really sorry to have let this get into our code base as-is. We should have been more careful.

There are multiple ways to deal with the problem. One is to do what I suggest and check the Stan functor. The other would be to catch the errors propagating from Sundials and rethrow appropriate errors. Just FYI, one of the errors must not actually be a C++ exception because it terminates the program immediately. I believe this may cause R to crash.

If we get segfaults, it’ll crash Stan and whatever process it’s running in. This is what I want to avoid. I don’t want a way for Stan to run that will crash its container. I’d be OK with a mode that we got signoff from a user on that it might crash their environment before it ran, but it’s not what I’d do.

Catching errors where they arise is the best thing to do for debugging. Otherwise, trapping errors in the ODE solver and throwing without good warning messages would also be acceptable, if not ideal.

Below is the example that stops Stan, depending on the run. It looks like it doesn’t actually seg fault. I just double-checked in RStan (good job, @bgoodri!).

I don’t know what sort of error is propagating. I thought we caught all exceptions, but maybe we’re only catching std::except? Not sure.

Below what? I didn’t see a link to what crashes. We should have catch (...) somewhere to catch everything. If this only depends on the bnf solver in CVODES, might the problem be instead that it’s a bad interaction of our callback function throwin exceptions within C?

Whoops. This one stops depending on the run. I’m not sure what error is propagating from CVODES and I’ve run across this error while modeling and it’s really hard to debug. There isn’t enough information around to help fix it. With the integrate_ode_rk45() function there are much better error messages.

functions {
  real[] sho(real t, real[] y, real[] theta, real[] x_r, int[] x_i) {
    real dydt[0];
    return dydt;
  }
}
transformed data {
  int<lower=1> T = 10;
  real t0 = 0;
  real ts[T] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10.};
  real y[T,2] = { {0.397095, -0.853654},
                  {-0.300526, -0.696679},
                  {-0.948297, 0.0545529},
                  {-0.57433, 0.634292},
                  {-0.00316177, 0.772725},
                  {0.675424, 0.15645},
                  {0.703672, -0.424857},
                  {-0.0918871, -0.487648},
                  {-0.469737, -0.298427},
                  {-0.429309, 0.16981} };
  real x[0];
  int x_int[0];
}
parameters {
  real y0[2];
  vector<lower=0>[2] sigma;
  real theta[1];
}
model {
  real y_hat[T,2];
  sigma ~ cauchy(0,2.5);
  theta ~ normal(0,1);
  y0 ~ normal(0,1);
  y_hat = integrate_ode_bdf(sho, y0, t0, ts, theta, x, x_int);
  for (t in 1:T)
    y[t] ~ normal(y_hat[t], sigma);
}

Hmm. That’s not a problem with the function itself throwing an exception, just that it’s not computing the right thing. So I wonder if there’s some higher level place we can do a more general catch operation in the thing we had off to CVODES?

Sorry for being unclear. Exactly: there’s no problem with the function throwing an exception.

Some of the problems:

  • when size of the array return is not the right size, it can either:
    • stop immediately and show this: Exception thrown at line 36: CVode failed with error flag -4
    • fail to initialize with this error message printed 100 times which is not helpful because it has nothing to do with the ODE being wrong:
Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception thrown at line 33: cauchy_lpdf: Random variable[1] is nan, but must not be nan!
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
  • runs and does not report an error.
  • when an exception is thrown from within the ODE function, the message that is propagated masks the real error:
Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception thrown at line 36: CVode failed with error flag -4
  • when the return value has NaN as elements, this is the error message:
Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Exception thrown at line 38: CVodeGetSens failed with error flag -25
If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

I think we can clean all of that up and make it much friendlier for users. It’s really tough to debug through ODEs as-is.

The CVODES error flags are all nicely described in their manual.

Yep, Michael is right in that these codes are nicely doced in the CVODES manual.

This stuff is the compromise we had to do when turning to a C library. The easiest solution to users to get their ODE going with the integrate_ode or the integrate_ode_rk45. Once that runs they may switch to bdf.

And BTW, when you attack an ODE problem one should always try the much cheaper non-stiff RK45 anyways first before shooting with BDF at a problem.

There are two things that bother me about just passing CVODES error messages. First, there are errors in the construction of the ODE system that we don’t catch, like we do for rk45, which manifests as a number of different error conditions in the Sundials solver. Behavior differs based in what’s in memory and it doesn’t fail all the time, the which means that sometimes it’s running and spitting out nonsense. That’s bad.

The second case that bothers me, but a bit less, is that rejection messages within the ODE function is not reported. Maybe we can’t get that out and that’s fine, but we should try.

Sorry, but running and reporting wrong results is bad.