tl;dr It’s not possible to estimate that directly. All one can do is increase adapt_delta
until one recovers the same step size as before, in which case the sampler behavior will be the same (up to expected MCMC variation). Even that, however, might not give the behavior that people have come to expect due to confusion in what exactly to expect.
High Fidelity
One of the conflicts here is exactly what to expect from an algorithm, or perhaps better what guarantees our algorithms provide.
The HMC sampler in Stan makes only one guarantee, and it’s a soft one at that: if there are no diagnostic failures then the MCMC estimators of the expectation values of square integrator functions maybe follow a central limit theorem with standard error given by \sqrt{ \frac{ \text{Var}[f] }{ \text{ESS}[f] } }. For now let’s be positive and elevate that maybe to a will and see what the consequences are.
Even under ideal circumstances the sampler does not provide bitwise reproducibility. Seemingly irrelevant or insignificant changes will alter the exact states given at each iteration of the Markov chain, yielding completely different outputs. Still, the hope is that the MCMC estimators themselves remain reasonable.
Even under ideal circumstances the MCMC estimators follow a central limit theorem which means the estimator +/- standard error intervals cover the true value only with probability. You run enough chains and those intervals won’t capture the true values eventually.
When the diagnostics fail? Well then all bets are off. The exact consequences of a model failure depend on the intimate details of the model and the mode of failure, and without being able to accurately estimate expectation values we don’t have any way to faithfully quantify what the consequences might be.
These points are all extremely important for both testing and setting user expectations. We can only test what the algorithms guarantee and we want users to expect only what the algorithms can guarantee. Let’s go deeper into that with regard to the step size adaptation and divergences.
You Better Watch Your Step
It is a great miracle in MCMC that for some algorithms we can construct (approximately) universal optimality criteria. In other words there is a function of some tuning parameters for which a specific value should yield nearly optimal performance. For Metropolis algorithms, or Metropolis-adjacent algorithms, these often take the form of expected acceptance probabilities.
So even if we don’t know what the optimal step size will be we can find the step size that approximately optimizes performance by tuning until that optimally criteria achieves the desired value. The relationship between the optimality criteria output and the step size? That is highly nonlinear and depends on the specific structure of the model – about the only thing we know is that it’s monotonic. So even under ideal circumstances there’s no way to predict what will happen to the step size when the adaptation targets a different value of the optimality criterion. In Stan parlance, even if you know what epsilon
is needed to achieve adapt_delta = 0.8
you can’t say anything about what step size is needed to achieve adapt_delta = 0.9
other than it will be smaller.
One condition of these criteria is that they presume the same ideal circumstances that we presume to get those central limit theorems. In other words any adaptation algorithm that targets these criteria will be valid only under those ideal circumstances, and hence only if we don’t see any diagnostic failures. Lots of things go wrong when the sampling breaks down.
First and foremost the sampler is no longer (asymptotically) unbiased and so the definition of performance has to change, which means that the nominal target value, like adapt_delta(epsilon) = 0.8
will no longer be relevant.
Secondly the exact functional dependence of the optimality criterion on the step size will change, so even if you could map out the relationship under ideal circumstances that relationship wouldn’t persist to non-ideal circumstances. And finally even if none of those were a concern we have to recognize that the optimality criterion, again adapt_delta
in Stan, is an expectation value that we estimate. When the diagnostics start to fail we know we can’t trust our estimates, so even if the optimal value was still adapt_delta = 0.8
then we wouldn’t be able to achieve that in practice because we wouldn’t have accurate estimates of adapt_delta
itself.
Phew. Again, once the diagnostics fail the behaviors can be…beyond belief.
(What’s So Funny 'Bout) Peace, Love, and Understanding
Let’s bring this all together and discuss the consequences for divergences and the adaptation procedure.
When you see divergences you know something weird is going on, and you have no way to quantify what the consequences of that weirdness will be on your MCMC estimation. As discussed above once the adaptation target itself changes, the optimal value is no longer well-posed, and, and we can’t estimate the adaptation statistics all that well anyways.
At this point everything about the adaptation criteria changes interpretation; we are no longer trying to achieve a certain value of the acceptance statistic. At this point we just want to reduce the step size in a desperate attempt to maybe sort of give the sampler enough resolution to explore without diverging, and for that we just need to rely on the monotonic relationship between the optimality criterion and the step size which fortunately is about the only thing that persists when divergences arise. Again, once divergences arise the meaning of adapt_delta
changes and the goal for setting epsilon
changes.
Will this help? Sometimes. Maybe.
By reducing the step size we make the numerical integration more accurate, which might bring the integrator into a region of stability where divergences no longer occur. Or it might let the sampler explore a pathological region even better and increase the number of divergences. Everything will depend on the exact structure of your model.
Now how can we tell what is happening and whether we can trust the results? Well the only signal we have is the residual divergences, but those divergences are an empirical measure – the sampler divergences only if it happens to explore near a pathological region. The smaller the pathological region the less likely we are to sample it, and hence the less likely we are to see a divergence even if one is there. Reducing the step size might shrink a pathological region but at the same time it might make it harder to see.
Yes you might have fewer divergences for the default 4000 iterations with the same seed, but that’s no guarantee that you won’t see divergences if you run longer or use a different seed (it’s always fun to see a divergence pop up in a course exercise, but only on one person’s computer).
Ultimately this is why no divergences will never be a sufficient condition for identify ideal circumstances for HMC estimation. Empirically they are very sensitive, but at some point you can no longer tell if you’re not seeing any because the pathological regions are tiny or because the pathological regions don’t exist (and at the moment we don’t have good enough theory to say whether small enough pathological regions lead to negligible error).
Anyways, if you see divergences and increase adapt_delta
a little bit to achieve a smaller epsilon
and the divergences go away then quickly then you’re probably okay, especially if you don’t need more effective samples than what you recovered in the fit. That general behavior will persist with the new adaptation, but it’s impossible to quantify how the numbers will change.
If you push adapt_delta
up higher and you keep getting a few divergences so you go even higher, and then higher, and so on. Well in that case you should be more worried. Not only will the sampling start becoming more expensive, it’s less likely that you’re actually fixing anything in a robust way.
Watching the Detectives
Let me try to summarize the state of things to put the discussion of the pull request into context.
The sampler in Stan makes only a few actual guarantees, and weak guarantee at that. We basically just have that
-
If there are no diagnostic failures then you have some evidence that the MCMC CLT holds and the MCMC standard errors will have the expected statistical coverage.
-
If you see divergences then all bets are off, but you might be able to fix things by decreasing the step size. The step size can be decreased by increasing the stepsize adaptation target above 0.8.
That’s it. In particular how much the stepsize adaptation target has to be increased is not guaranteed to be any given number of even constant. The only contract we provide to the client (you, the user) is that adapt_delta
ad epsilon
have some monotonic relationship, so you can shift one systematically by moving the other in the opposite direction.
None of those guarantees change with this pull request. Under ideal circumstances the MCMC estimators and standard errors will still have the expected coverage, and if divergences appear then they might be able to be moderated by decreasing epsilon
which can be done by increasing adapt_delta
.
I know that these guarantees are not particularly comforting, but that’s what we have (and we should be lucky to even have those!). At the same time I know that users are prone to building up empirical expectations; if we accommodate these then we freeze the sampler in an arbitrary state. We have to be clear about what we guarantee and try to temper any inaccurate expectations that might arise.
Anyways, I hope that clears things up a bit.