No. I would see it as a divergent transition thing. It should not happen, but when it happens one should follow up.
One strategy could be to warmup the sampler and then restart with the same step size, mass matrix and initial the different ODE solvers if one is very worried about convergence coming out differently.
I disagree. Whenever you exhaust the max_num_steps, your inference may be severely compromised because you cannot explore part of your parameter space, irrespective of the curvature.
That region of parameter space may be unimportant or it may not be, but the correct metric for that cannot be how stressed the integrator is.
Are you sure tho that it can compromise your inference? Because if it happens during sampling that max num steps is reached, sampling stops and you dont get any draws from that chain (at least this was the case with most recent cmdstan). This should be a warning sign to not make conclusion of the results of other chains, or possibly rhat catches this problem.
If it occurs during warmup tho, sampler doesnt stop seem to be stopping. I would really like to understand what are the rules for recovering from domain and runtime errors that stan_math gives, in order to be able to answer the question whether too low max_num_steps can lead to biased inferences that diagnostics dont catch.
With compromised inference I referred to a case where results that are wrong and diagnostics dont catch it. If they catch it then you can always sample again with better settings so you dont make bad inferences.
I’m actually not sure whether this is what happens, or how this is handled in general. Still, the solver not finishing if we are in the typical set (and then what happens?) can not be great.
It’s totally fine for MCMC samplers to reject draws. It’s still not good to do it, of course.
No, that’s not true to my knowledge. The attempted draw gets rejected, the sampler will “recycle” the last draw and then the sampler will try again. That’s in principle safe to do for MCMC samplers. It’s just not a good thing to happen.
… and in all honesty: When I model an ODE problem which takes long to compute and there are occasional hiccups like these, then I would try to find the root cause, but I would certainly use the model fit. I would not use that for a final decision making for the purpose the model was build for, ok… but saying that such a fit cannot be used for anything would be unreasonable in my eyes.
It’s perfectly fine if ODE solver throws max step error during warmup: sampler is exploring and we allow hiccups. On the other hand if the solver does this during sampling the final draws we get is biased because we fail to see outcomes from certain part of the parameter space.
In the lower level, my understanding is that the max step error (eventually) leads exception to sampler and is interpreted as divergence there, which leads to current level of trajectory expansion being abandoned and a sample is drawn from previous trajectory expansion. In this context max step error behaves exactly like reject statement.
Under certain conditions rejecting a numerically imprecise state can maintain the target distribution, and the asymptotic consistency of Markov chain Monte Carlo estimators, but it may easily frustrate the preasymptotic behavior. At the same time under other conditions the rejections can compromise even that asymptotic consistency. The question isn’t so much if there’s a bias but rather how significant is the bias.
The problem is that the numerically unstable region can hide other important regions of the target distribution. Imagine for example two regions of high probability connected by a narrow bridge that causes numerical instabilities – when all attempts to pass through the bridge are rejected due to numerical instability issues then the sampler will only ever see one of those regions and be unable to fully quantify the target distribution. If all of the Markov chains end up on the same side then no diagnostics will ever indicate the missing probability on the other side.
A lot of the intuition that has been discussed here is based on the assumption that the numerically unstable region is near the boundary of relevant behavior, in which case the inability to sample the unstable region only mildly biases the exploration. This is often true, but it’s not universally true.
A few heuristics that are more reliable than others:
If numerical problem arise in the warmup phase but never in the main sampling phase then the unstable regions are likely far outside of the typical set and can safely be ignored
If multiple chains are run from diffuse initial conditions, are run long enough, see numerical problems in the same neighbored, and that neighborhood is in the tail of the other samples then the unstable region is probably on the edge of the typical set which minimizes a bit it’s potential consequences.
But these are only heuristics and they offer no guarantees.
Maybe there should be better way for ODE models to reject parameters.
Take that orbital mechanics example. I’m not sure, but I think max_num_steps usually gets exhausted whenever two planets are on a collision course, or otherwise orbiting each other very closely.
Aside from a different parametrization which might exclude such cases, one solution which would not rely on hitting max_num_steps would be to check current trajectories/states for such a thing. We know that the planets can not actually collide, as we would then not have our future observations, i.e. the probability of this happening is literally zero.
However, currently we cannot check these things (easily). In principle we could check this in the specification of the ODE vector field (dydt), but the values that get plugged in there are not necessarily the final values, as the solvers may themselves “reject” some trajectories due to precision concerns.
What would help would be to be able to pass another UDF which only gets applied to “actual” trajectories (accepted/provided by the solver) and which weeds out problematic parameters/trajectories.
One has to be careful with concepts like “reject” – in particular we have to differentiate between what is model and what is computation.
Right now many dynamical systems specified via ODEs contain extreme behavior that manifests as certain parameter values. This stresses the ODE solvers when they’re not configured appropriately, leading to large numerical errors in the final states and their sensitivities which then induces problems with the computation. In other words the computational problems are a manifestation of extreme behavior in the model.
The computational solution is to use, well, better computation. By putting more work into each ODE solve we can better resolve the extreme dynamics and accurately quantify their contributions to the posterior.
The modeling solution is to remove the extreme behavior from the model. The most direct approach is to use more informative prior models that suppress the extreme parameter configurations; this has critical benefit of yielding a smooth target distribution that facilities accurate computation of the resulting model.
Jury-rigging the “reject” function based on observed numerical solver behavior, on the other hand, only implicitly changes the model. Because the rejection is based on solver output it can account only for extreme behavior that manifests in particular numerical behavior; for example if some collision events integrated fine but some didn’t then the reject approach wouldn’t be removing all collision events just some particular collision events. This selection can have unintended effects on the final inferences. On top of that the rejection introduces a discontinuity – the target density is non-zero on one side of the reject condition and zero on the other – which complicates the computation.
Ultimately removing extreme behavior can be extremely useful, but one has to do it in a principled way lest the final inferences be compromised. Almost always this involves careful prior modeling motivated by numerical pathologies instead of trying to define implicit models based on those numerical pathologies.
Yes, hence my proposal to provide users the opportunity to reject samples based on the behavior of the model, not implicitly based on the behavior of the solver.
If applicable/possible/reasonable, doing this via “better” priors may be the best way, but sometimes it may be necessary/desirable to not outright exclude parameter values of which we do not know a priori that they may lead to problems down the road.
Rejecting samples for any reason implicitly excludes certain parameter values. Because the evaluation of the model for fixed parameter values is deterministic if any of behavior induces a rejection it will always induce a rejection for the same parameter configuration.
Not sure what we are disagreeing on. Do we agree on the following, which is my only point?
Ranked by “badness of reason for rejecting a proposal” we get:
Divergence < Domain knowledge says state of ODE is ridiculous (e.g. collision event) <<<< Impatience (solver took too long)
Example: Say you are an astronomer in the 19th century, and something’s weird about the orbit of the most recently discovered planet. You suspect there is another planet somewhere, but all you know is that is in a stable orbit around the sun. What do you do with proposals that would imply a collision with one of the other planets?
The issue here is not a question of how good the reason for rejection is but rather what criteria for rejection are actually valid.
If the numerical ODE solver is taking too long or otherwise returning inaccurate results then you are not getting a faithful evaluation of your posterior density function and hence no algorithm will give a faithful quantification of your posterior. Similarly if the gradients of the posterior density are changing too quickly then the symplectic integrator used by Stan might become unstable, in which case the Hamiltonian Markov chains will not return faithful quantifications. In either case you can’t determine whether any empirically observed behaviors are actually supported by the posterior or are just numerical artifacts.
In order to determine whether your posterior contains undesired behavior because of an overly diffuse prior, and hence what kind of modifications you might need to make to your prior either explicitly (changing the prior model) or implicitly (rejecting based on functions of the parameters), you need an accurate quantification of your posterior distribution. Even then narrowing the prior doesn’t mean that Markov chains won’t encounter pathological behavior while trying to find the posterior typical set during warmup.
If you’re trying to discover neptune by fitting only stable orbits then the best approach is to build a model with only stable orbits, elliptical orbits. Then any numerical problems that arise are purely due to extreme stable orbits but nothing else. Of course this might have unintended side effects (limiting how close the bodies can approach, excluding resonant behavior, etc).
The second best approach would be to build a prior that concentrates around stable orbits but does but does not entirely exclude unstable orbits. In this case numerical problems might be due to extreme unstable orbital configurations in the typical set or even those found while searching for the typical set, or they might be due to extreme stable orbital configurations. The more complex the model the more ways it can break.
Finally the most ineffective approach is to let everything in and pump a ton of carbon into your numerical solver to try to resolve every stable and unstable orbit that the sampler might encounter, either as a configuration that’s consistent with the data or a configuration that is passed while finding those consistent configurations.
Setting max_num_steps=1e12 results in integer overflow and
Rejecting initial value:
Error evaluating the log probability at the initial value.
Exception: ode_rk45_tol: max_num_steps is -727379968, but must be positive!
Edit: realized that this is because Stan can work only with 32-bit integers, 2^31-1 is the maximum value.
Edit: I wonder why max_num_steps=1e11 then works? I mean 10^11 > 2^31 - 1 right?