Details of initial convergence towards typical set in Hamiltonian Monte Carlo

After reading the introductory HMC paper, I’m confused about the initial warm-up phase. What exactly is happening there?

I’m especially interested in the mathematical details behind

a) The chain rapidly approaching the typical set, and

b) warm-up stopping criteria, i.e. how the chain knows that it got close enough to the typical set and can start leapfrogging now.

Thank you

There are a lot of mathematical details here.

a) You want to look at convergence of MCMC to the stationary (aka equilibrium) distribution. There are lots of ways to measure that. In cases where you can prove geometric ergodicity, you get exponential decay in total variation distance (i.e., the maximum error in calculating an event probability). Another approach is to look at convergence of square error in expectations. You can also do this empirically by monitoring the trace plot of the log density function. When it asymptotes and then starts bouncing around, you’re there. I really like this paper for a good high-level and not too mathematically intense intro from Roberts and Rosenthal:

Section 3.4 covers geometric ergodicity. Since you asked for math, here are two papers about geometric ergodicity for HMC:

b) We don’t have warmup stopping criteria in Stan. We run HMC/NUTS and hence leapfrog, during warmup. By default, Phase I of warmup runs 75 iterations (when there are 1000 warmup iterations), because we’ve found that works well empirically. Sometimes it takes a little longer. It’d be better to control this somehow, but we haven’t tried yet.

Excellent, thank you!

Since you mentioned that NUTS is run in warm-up, may I ask you to correct an intuition of mine about that? It goes as follows:

Assuming a high-dimensional Gaussian target distribution, we’ll initialize in the tails with high probability, so level sets are big and of low density. The NUTS sampler terminates once the boundaries of the trajectories move toward each other, so at the first couple iterations it runs across half of those huge, low value level sets.

This seems to be slow and therefore undesirable, but in reality I’m sure it isn’t. What am I missing?

You’re right that random initialization tends to initialize in the tails. This is a good thing. With multiple initializations, we can find modes more easily (not that we can sample a multimodal density—it’s just good to diagnose those problems) and also diagnose convergence more easily.

  1. The acceptance probability for Metropolis is p(\theta^*) / p(\theta), where \theta^* is the proposed state and \theta is the starting state. The worst place to start is at the mode, where acceptance probabilities will be the lowest and a small step size is required. And while we don’t technically use Metropolis this way in NUTS, it will still be a problem because we choose the next state proportional to its density.

  2. Momentum and NUTS. We’re doing Hamiltonian Monte Carlo until we hit a U-turn. That includes a momentum term. Coupled with acceptance being 100% when moving to a higher density state, this gets us to the bulk of probability mass quickly. The level sets have a fixed Hamiltonian, i.e., a fixed density for p(\theta, \rho) \propto e^{-H(\theta, \rho)}, where \theta is position and \rho is momentum.

  3. Adaptive step sizes. During warmup, we adapt the step size for an 80% average acceptance rate across a trajectory (by default—you can change the target acceptance rate). So we can start with a large step size because of (1) and then tone it down as we get closer to where we’re going to be sampling. NUTS does this automatically during warmup.

1 Like