NUTS vs HMC

I understand HMC and its shortcomings that led to NUTS. However, I do not understand a few things about NUTS:

  1. Why does NUTS need to go both forward and backwards in time to figure out if there is a u-turn
  2. Is the momentum vector r sampled at each substep or do we have only one r for each NUTS ‘big’ step
  3. Stan’s version of NUTS chooses the final parameter value based on multinominal sampling with a bias towards later substeps. How does this relate to the target acceptance probability?
  4. Are the substeps that are considered in multinomial sampling only the 2nd, 4th, 8th etc. or is every substep considered?

I have read various sources on this, however, they all are either too technical or do not go in detail.

Thanks!

Just want to make sure that this is up to date. @betanalpha’s comment with regarding his 2016 paper indicates that for at least a few years now, Stan’s algorithm is different enough from NUTS (different criterion used for when to stop integrating) that it has been designated with a new name (XMC/XHMC).

2 Likes

For detailed balance. It’s explained in the original paper. It also has to throw out the last doubling if there’s an internal U-turn.

Not quite. I’m pretty sure we never used XHMC in Stan. Stan has progressed beyond the original form of NUTS used in Hoffman and Gelman, but the key component of the no-U-turn condition formulated for detailed balance remains. If we were in the ML world, we’d be on something like our fifth named sampler at this point.

No. There’s a very clear form of the algorithm in McKay’s info theory book, which is avalable free online with the publisher’s blessing.

I believe the acceptance probability is averaged over all the states compared to the initial state. The later doubling bias is super important here for directed exploration.

Every substep, but with bias toward the latter ones.

The NUTS algorithm’s written up in gory detail in the the paper. The main difference is that we now do multinomial sampling instead of slice sampling and compute acceptance rate a bit differently.

2 Likes

First let’s clear up the terminology.

“Hamiltonian Monte Carlo” is a technique for designing algorithms that utilize numerical Hamiltonian trajectories.

The original implementation of Hamiltonian Monte Carlo was “Hybrid Monte Carlo” which used static integration times (fixed number of steps) to construct a Metropolis proposal; this algorithm is also unfortunately commonly referred to as “Hamiltonian Monte Carlo” which only confused everything.

The “No-U-Turn Sampler” was the first practical implementation of dynamic Hamiltonian Monte Carlo which utilizes adaptive integration times, where the number of leapfrog steps can vary. In particular the No-U-Turn sampler is not a Metropolis algorithm. The implementation also included specific adaptation algorithms for the sampler configurations that remained fixed, namely the integrator step size and the inverse Euclidean metric that defines the kinetic energy.

“Exhaustive Hamiltonian Monte Carlo”, or “XHMC”, is another dynamic Hamiltonian Monte Carlo algorithm that utilizes a different criterion for determining how long the numerical Hamiltonian trajectories should grow at each iteration.

Currently Stan uses an advanced dynamic Hamiltonian Monte Carlo algorithm that is not the No-U-Turn sampler. Just about every element of the algorithm has been improved based on theoretical guidance supported by empirical studies.

Because the current algorithm has been in continuous development over the past few years I never branded it with a new name, which has unfortunately caused confusion due to the common presumption that Stan simply implements the original No-U-Turn sampler. Long story short there was a decision to rename the algorithm at one point but governmental issues prevented that from moving forwards.

Hamiltonian Monte Carlo methods need to correct correct for the error introduced by the numerical integrator which requires careful statistical methods.

Hybrid Monte Carlo, for example, uses a Metropolis correction to select between the initial point and the final point, but in order for that to work the numerical Hamiltonian trajectory has to be modified into an involution by flipping the momenta of the final state.

Dynamic Hamiltonian Monte Carlo methods typically use more sophisticated corrections. Really the best way to think about these corrections is via a two step sequence – sample an entire numerical trajectory from the set of numerical trajectories that include the initial point, and then sample a point from that sampled trajectory using the right weights. Randomly integrating backwards and forwards is just a clever way of sampling from the set of numerical trajectories that include the initial point.

Each iteration of dynamic Hamiltonian Monte Carlo first samples a momenta, then builds a trajectory, then samples a point from that trajectory. Within an iteration momenta are influenced only by the deterministic Hamiltonian dynamics.

Because Stan’s dynamic Hamiltonian Monte Carlo algorithm, which again is not the No-U-Turn sampler, is not a Metropolis algorithm there is no acceptance probability.

Stan does include an acceptance statistic which is a sort of proxy of an average Metropolis acceptance to guide the step size adaptation.

All of the points in the numerical Hamiltonian trajectory are considered.

The challenge is that the mathematics of Hamiltonian Monte Carlo are fundamentally difficult, so it’s nearly impossible to provide principled motivation for each step without going into a decent amount of detail. For example my review [1701.02434] A Conceptual Introduction to Hamiltonian Monte Carlo is my best attempt at going only into detail relevant for practical implementations, and even that has plenty of technical detail.

5 Likes

You can name it whatever you want in a paper.

I don’t remember any discussion of renaming, but I haven’t been paying a lot of attention to governance lately. The main issue it brings up in the interfaces is backward compatibility.

As far as I know, the U-turn criterion is the same as the original paper’s.

To add to your reading list, @betanalpha generalized the no-U-turn criterion in a neat way to the Riemannian setting in another paper. And his XHMC paper introduces an alternative with similar goals.

I’m not sure why you would say these sentences right next to each other. The current U-turn criterion is the generalized criterion, not the original one. Even in Euclidean space they coincide only in the limit of continuous trajectories. The discreteness of the leapfrog integrator creates a small difference.

1 Like

For the usual reason—I belived them to be true.

It sounds like you’re saying they’ve changed:

I missed the change and don’t know the difference. Where’s the doc on the new condition so I can compare?

2 Likes

I don’t know if there’s documentation besides @betanalpha’s original paper. For a quick visual comparison I’m quite proud of the pictures I made in this post.

1 Like

https://arxiv.org/abs/1304.1920 and Section A.4.2 of https://arxiv.org/abs/1701.02434.