As I understand things, the numerical imprecisions lead to problems when integrating the hamiltonian flow. This is exactly why a metropolis hastings steps is done to the generated proposals. Hence, the MH step is exactly there to get rid of the problems implied by numerical imprecisions. Apparently this does not seem to be enough to get rid of bias as Mike has now repeatedly reported.
As discussed during the Stan meeting this is a very hard problem to solve rigorously. However, I am wondering if there are some case studies around which would shed some light on this? My main motivation would be to understand if the bias is practically relevant or not. This can probably not be answered in general, but having strategies to detect a real bias would be great.
If we could collect some ideas on this, it would help us all a lot to get some clarity on the matter.
There’s no MH step in NUTS. There’s an MH step in static HMC, but that’s not what we’re usually doing. The original NUTS paper was slice sampling to draw a state from the Hamiltonian trajectory (biased toward last doubling of steps); the way it works now is with a categorical draw from the path with probabilities proportional to density (again with bias toward the last doubling).
If the Hamiltonian were simulated perfectly, MH would always accept for static HMC and we’d always choose a new state from the second half of a trajectory in NUTS.
What happens with the leapfrog without divergence is that it wobbles around the true Hamiltonian value going up and down (you can see that, for example, plotted in Radford Neal’s paper in the MCMC handbook or in some of the books and papers on symplectic integrators).
When we hit a divergence, we spin out and never come back.
Now technically, doing what we do with NUTS would still work in theory—if we ran the thing forever, we’d get there. What happens in practice is that we wind up with a random walk that has a very very hard time getting into the neck of a funnel, for example. And when it does, it stays there a very long time.
So it’s not that there’s asymptotic bias so much as bias in any sample drawn in the time we have.
This is the same problem for Gibbs and Metropolis-Hastings itself. Sure, they converge asymptotically. But we don’t have asymptotically long to wait! (I think that’d make a good blog title, by the way.)
With the numerical imprecision we see we don’t get divergences, just really
low acceptance rates, so the tunning algorithm runs the stepsize into the
ground. I have some survival models that demonstrate this and I agree it
might be handy to use them to characterize this issue. We should at least
be able to give people some pointers on what to do when they run into
this.
Let me clear up some points that continue to be confused.
In MCMC/HMC there is a sampling/proposal step that uses the log target density to determine the next point in the chain. This ensures that the Markov chain yields accurate estimates.
If we cannot evaluate the log target density to high precision then this sampling/proposal step does not ensure accurate estimates.
In HMC we want to accurately simulate Hamiltonian flow which allows us to sample/accept states very far away from the initial state. If the simulation of the flow is inaccurate, however, then the acceptance probabilities plummet.
The accuracy of the simulation of the Hamiltonian flow can suffer either because the symplectic iterator is poorly tuned or the gradients are not being evaluated correctly. In Stan’s adaptation routine we assume that the gradients are exact, so only the tuning of the integrator can lead to poor simulation of the Hamiltonian flow. We then use this the accuracy of the simulation as feedback to tune the integrator – i.e. we use the average Metropolis acceptance probability to tune the integrator step size.
If the gradients are incorrect, however, then this adaptation routine will fail. The adaptation will keep dropping the step size as it’s the only thing it assumes can improve the accuracy of the integrator.
The technical details are important here as they separate out the various cases under consideration. In @sakrejda’s case, we know that the log target density calculation is sufficiently precise but the gradients are questionable. That puts us into (5). In @wds15’s case, however, inaccurate ODE solving can affect both the log target density and the gradients, so we run into (2) + (5) and the interpretation of what is going on becomes far more complicated.
Thanks Bob… this is a great text to put on the blog, indeed.
I take that the MH step done for HMC corrects for the wobbling around the total energy value. I need to read the papers in more detail (once again), but MH unfortunately do not correct imprecise numerics.
As far as I understand, you touch on the concept of geometric ergodicity which is stronger than the plain ergodicity. The geometric bit ensures that we get bias free estimates from finite MCMC samples which is what we need. Now, what is finite remains to be shown and what is the order of the bias when geometric ergodicity is not given (as a function of “finiteness” or better number of samples).
This raises the question of diagnostics for the properties of a finite chain. Would it make sense to define running window quantities? Like a running window Rhat, mean or whatever. As the window size goes large we endup in the usual quantities, but on its way there we could see directly if we have trouble with finiteness or not; and best of all hopefully also to what extent.
Just some thoughts. My apologies if they are naive.
Would it make sense to run the warmup with high accuracy of the ODE integrator and then relax accuracy? Then we could be sure on getting the tuning right during warmup and then kick off a few chains from different seeds and sample this much faster. This procedure would be doable once the mass matrix can be saved and reused.
Regarding (3): How come the acceptance prob plummet whenever the flow is in-accurate? Do we run into random walk behavior and as such things get nasty?
Would you consider to it worthwhile for the ODEs to calculate the states which define the log-prob with higher accuracy than the gradients themselves? We could do that with the CVODES integrator.
As we accumulate many terms going into the log-prob, I wonder how that plays out with the in-accurate ODE solutions. So one needs to look at how such accumulated quantities behave wrt to error propagation.
Would it make sense to run the warmup with high accuracy of the ODE
integrator and then relax accuracy? Then we could be sure on getting the
tuning right during warmup and then kick off a few chains from different
seeds and sample this much faster. This procedure would be doable once the
mass matrix can be saved and reused.
2.
Regarding (3): How come the acceptance prob plummet whenever the flow is
in-accurate? Do we run into random walk behavior and as such things get
nasty?
The ratio used to calculate the acceptance prob is the ratio in energy from
start to finish, start at top of pg. 4 and bottom of pg. 16
If the calculation is inaccurate you sometimes get very low acceptance
rates and the tunning adjusts by lowering the step size in both cases. In
my model you just get much more noisy acceptance rates but they’re obv. on
[0,1] and generally > 0.6 so it’s the very low ones that bring it down.
Krzysztof
Would you consider to it worthwhile for the ODEs to calculate the states
which define the log-prob with higher accuracy than the gradients
themselves? We could do that with the CVODES integrator.
As we accumulate many terms going into the log-prob, I wonder how that
plays out with the in-accurate ODE solutions. So one needs to look at how
such accumulated quantities behave wrt to error propagation.
If the Hamiltonian simulation was perfect for static HMC, then the MH accept probability is 1. When it’s imperfect, we go away from that. When the Hamiltonian simulation is really bad, we spin off into a tail and the MH accept probability goes to zero.
Having said that, we do NUTS, where we’re not quite doing an accept step, but we always have the initial state as a candidate, and it will dominate in the same way so that no other state is acceptable.
Everything we’re doing is a tradeoff between accuracy and time. We want a large step size for speed and we want a small step size for accuracy.