Heuristics for the advantages of NUTS/HMC/Stan vs RWMH

I have been talking to a colleague on why Stan (NUTS) is better than, say, RWMH, and found that I don’t understand the issue coherently, so I thought I would ask here. Broadly, the question is how to think about the break-even point in the dimension of the problem where NUTS/HMC is more efficient than derivative-free methods.

While NUTS/Stan takes more time per sample, ESS is better, so if things work well, time/effective sample will be lower. To make things concrete, let the likelihood be ℓ:ℝᵈ→ℝ (for simplicity, could be a multivariate normal).

Let f(d) denote the time/effective sample for a perfectly tuned RWMH, and similarly g(d). From Neal’s chapter in the MCMC handbook (5.4.4.2), one can heuristically argue for f(d) ≈ F d², g(d) ≈ G d^(5/4), for constant terms F and G. Is this a good way to think about this?

Can we say anything about F and G? Eg suppose we neglect the implementation overhead for RWMH and NUTS. How does G depend on d? Does it change the asymptotics above?

The problem with this is that the HMC scaling doesn’t hold when the integration length is chosen dynamically (as in NUTS and the super-NUTS thing that Stan actually implements), where you’d expect them to do better (I think).

I think the standard answer is that random walks, by there very nature, spend a lot of time going backwards and forwards on their way to their target, which means that they can take a long time to explore a distribution, especially in high dimensions.

HMC, on the other hand, does it’s best to take giant steps and not stagger around drunkenly. When you actually look at the maths, it turns out that it doesn’t totally succeed. It takes giant steps for each “energy level” (ie level set of the hamitonian), but it has a 1D random walk in energy. (See Betancourt’s conceptual introduction paper)

So a different way of thinking of it is that:

  • Random walks get less efficient at exploring the space as the dimension increases
  • RWMH does a d-dimensional random walk.
  • NUTS does a 1-dimensional random walk, and something much much more efficient in the other 2d-1 dimensions (it’s 2d-1 because you need to add the momentum variables). [NB: Energy level is a weird function of all 2d variables, so it’s not like you’ve got one coordinate that does worse than the other ones.]

Edit to say: Lots of information about this is here

3 Likes

Just to finish out the answer, HMC moves a certain fixed distance along the curve. (Actually, it moves along the curve for a fixed amount to travel time). NUTS (and it’s super-fancy-not-really-NUTS-anymore Stan variant) does its best to fling itself as far from the starting point as possible, so it really is trying to take quite large steps, non-wobbly steps in the non-random-walk directions. That’s why NUTS/Stan is usually much more efficient that vanilla HMC.

Derivatives are necessary to do anything in high dimensions, otherwise you’re just guessing and checking which will become exponentially hard as you increase the dimensionally. That leaves you with diffusive gradient methods, like Langevin methods, or coherent gradient methods like HMC. Coherent beats diffusion every time, hence you’re naturally led to HMC.

The “Conceptual Introduction” paper goes into much, much more detail.

3 Likes

Yes! Cole Monahan’s paper provides some practical comparisons. Or you can see the original NUTS paper that compares to Metropolis and Gibbs on a very simple 250-dimensional multivariate normal (where Gibbs and optimally tuned symmetric random-walk Metropolis fail to converge after 1M iterations and NUTS nails the posterior in 1K iterations).

The problem with all this O(N^(5/4)) vs. O(N^2) stuff is that it doesn’t take into account the constant factor, which can be huge for Gibbs or Metropolis if there are highly correlated parameters in the posterior.

But forget all the theory. What does it do for problems you care about?

The main thing we worry about is that RWM and Gibbs can both fail in ways that won’t be diagnosed by R-hat or n_eff estimates because it will do a good job of exploring a subset of the posterior, so it looks like everything’s mixing well.

The question arises as to when it’s a new algorithm. We’d probably be on our tenth named algorithm if we were ML researchers.

The current “NUTS” implementation differs from the original in two important ways:

  • warmup adaptation—Michael substantially changed this for the better in Stan 2.0 and has tweaked a bit since then

  • replacing the post-warmup NUTS slice sampling with discrete (multinomial with size 1 draw) sampling along the path

Also the termination criterion was generalized to admit dynamic Riemannian HMC implementations. And that ignores all of the diagnostics that have been added. Basically the skeleton of NUTS is still there, but the flesh around each bone has been replaced with more principled and better performing muscle. Maybe not the best analogy.

Have you considered versioning NUTS? NUTS2.02

NUTS2: Statistics Boogaloo

2 Likes

That was super cool and is what we were calling “GeoNUTS” informally. (If you write it like Andrew would, as “Geonuts”, then it looks like it could rhyme with “donuts”.)

Michael weighed in as being opposed to both calling the current thing “NUTS” and to renaming every algorithm modification machine-learning style.

Like many nice ideas, when NUTS was implemented it wasn’t understood theoretically nearly as well as it is now after Michael’s exhaustive HMC paper (on arXiv). Word on the street is that Michael has some more theoretical analyses of well known algorithms in progress.

1 Like

Nice! We haven’t had a @betanalpha head-scratcher for a while.

Or we could follow the pronunciation of geoduck so that it would be pronounced “gooey NUTS”. Fun to say, probably not all that appropriate for a general tool. Same reason why Geometric Optimization via Neighborhood Adaptation Design" was canned.

It would give my papers more citations and possibly myself more credit, but it just confuses the users of Stan unnecessarily. I think that we really want to reinforce the ideal where the computation is abstracted away and made as opaque as possible to the users. They should know about tree depths and divergences, but not the details.

Are you referring to the optimization stuff?

Indeed, as it sounded like a similar kind of research to me.

That may be because today, I’m a lumper, not a splitter*. I just finished my research statement for promotion where I tried to make 30 years of research history sound coherent (made doubly challenging because I think of myself as more of an engineer than a researcher these days and because Stan is such a group effort).


* I was recently surprised that a bunch of people I talked to didn’t know this distinction by name, so here it is:

Sort of – some new foundational understanding of existing algorithms in optimization that may help clarify what works and why.

I’m looking forward to that. I really liked your concentration of measure/ring video. It made sense of why optimization doesn’t work in high dimensions.

I still wonder how people get away with optimizing on big neural network models though. Is it just that they have so much data/regularization that it doesn’t matter?

Yes they have much of data/regularization. State-of-the-art is using stochastic optimization with large enough step sizes so that it can’t converge to narrow modes. It does matter. There are some approaches combining variational methods and optimization, which show that even approximatively integrating over some of the parameters helps, but these approaches still need regularization of the optimization. There are some smaller examples showing benefits of MCMC, and there are people working on improving approximative integration for neural networks, but it’s very likely that the strong coupling of the model and the algorithm affecting the predictions will remain.

It’s that they don’t care about uncertainties – when you compare models based on point predictions then all you need to do is find one good parameter setting, not all of them. Unfortunately that doesn’t carry over to real world problems where accurate quantification of uncertainty is critical to robust decision making.

1 Like

Ah interesting I didn’t think about using big step sizes on purpose so that you stay out of narrow modes that don’t have much posterior mass. For the variational methods do they use VI to get posteriors are on all the weights? That seems hard with some of these models having millions of weights.

Do you remember the authors/links to the approaches integrating over the parameters?

Heuristically with careful inits (random, semi-supervised, etc.). These algorithms are only guaranteed to find local optima from their starting points, and even that’s suspect given the complicated geometry. But the point isn’t so much to optimize some training objective as to optimize a test objective. Finding the true maximum isn’t important on that criterion as there’s no guarantee the (penalized) MLE is going to give you good performance. Hence all kinds of ad hoc stopping criteria (approximate penalty/regularizations) and training regimes that can be made to work in practice.

I don’t understand this line of argument. As Andrew likes to say, they can make self-driving cars and play Go. I think that satisfies what most people think of as robust “real world” decision making.

What doesn’t carry over is the theoretical guarantees along the lines of Savage. And note that those only hold if the model is correct. When the model’s misspecified, as they almost always are, you get accurate quantification, but with respect to the wrong models. So all the guarantees of robust decision making go away and it all comes back to things like empirical posterior predictive checks for things like calibration.

2 Likes

I think the real issue is one of ductility of failure. You can make a self-driving car, but it’s even better if you can make a self-driving car that knows when it isn’t sure what is going on, and can choose actions that are safer when it is more uncertain etc.

1 Like