Arguments for HMC

A few years ago, I was at a Bayesian conference when one of the speakers said, somewhat apologetically, that they had started work on their project two years earlier when “everybody was into Stan, but of course now everybody uses Bouncy Particle Sampler”. I laughed at the absurd fashion-chasing, but now I’m less sure. In training courses, coaching, etc, this kind of concern can come up.

I’m interested in what people say in such a discussion. What are your arguments in favour of HMC? There’s of course a good reason to use a probabilistic programming language, but setting that aside…

3 Likes

I believe there have been several discussions on this topic. IIRC This is the most recent one.

1 Like

Nice, thanks for pointing me to that. I especially appreciated contributions there from @maxbiostat and @spinkney. I was less interested in defending Stan the platform (that seems straightforward) as HMC/NUTS.

2 Likes

TL;DR: Dynamic HMC is robust enough and has good diagnostics. You might find an algorithm that outperforms it, but the implementation overhead better be worth it.

I honestly find this a bit weird. The way I see it, algorithms are ways of solving problems. What matters is the problem, not the algorithm. So whenever someone says “HMC >> RWMH” or “Bouncy >> HMC”, my first question is: “for which target(s)?” and the second is “what kind of gains are we talking about?”. As an applied statistician, I care deeply about having something dependable I can write my models in so I can focus on the modelling. I love the fact I can write the model I have in mind in Stan and get posterior expectations in a fast and robust fashion.

So Stan is great because it’s a well-maintained language with a mature implementation of a sampling method (dynamic HMC, dHMC) which is robust enough to work well in many cases. More importantly, perhaps, there are built-in diagnostics that tell you when things go awry.

As regards to dHMC specifically [which I think is really the core of your question], I think the particular implementation in Stan is robust enough to overperform naive Gibbs and MH for most targets people care about, which coupled with the fact it’s easy to use thanks to Stan, make it really attractive as a first-stab at solving a particular sampling problem. The diagnostics, which are not exclusive to Stan and thus should be associated with dHMC, are also a big plus.

There are models which are not very easy to fit with HMC (in Stan), such as those with discrete latent structure, but the silver-lining is that marginalising over the discrete parameters actually leads to more efficient estimators of the other quantities in the model.

One of the things I work on is phylogenetics, where HMC is slowly coming through, but I’m sure other (non-reversible, mainly) algorithms could still provide substantial gains. So, in that realm, HMC could well not be the best choice. If someone comes up with Stan for trees, though, the arguments above might apply.

5 Likes

Good point about diagnostics. Anyone writing their own code for newer sampling algorithms will have to pick over outputs forensically.

1 Like

I feel confident saying that no-one ever really used the bouncy particle sampler when it came out, and definitely no-one uses it today. It was popular in the more theoretical computational statistics literature because it is more amenable to many of mathematical analyses techniques that are popular in the field than something like Hamiltonian Monte Carlo.

One problem is that much of the computational statistics literature falls prey to the same temptations as the machine learning literature – contributions have to be ground breaking to be appreciated and so everyone says that their new methods are groundbreaking. Anyone judging the current algorithm situation by attending conferences will be easily mislead about what methods are state of the art and which methods are actually used in applied work. There’s not much that anyone can do about that other than have educational material ready for those who are interested.

Hamiltonian Monte Carlo is special because it’s built around the right problem – constructing and implementing measure-preserving flows to rapidly explore externally-specified target distributions. Measure-preservation ensures that the exploration stays within the typical set even with increasing dimension, and the coherency of the flow ensures that the exploration is as computationally efficient as possible, even when the typical set is complex. Hamiltonian Monte Carlo is essentially* the unique general method for constructing and implementing a measure-preserving flow for smooth target distributions defined over continuous spaces, making it uniquely suited for probabilistic programming applications where target distributions are bespoke to each application and can’t be guaranteed to have any nice mathematical structure (other than smoothness).

*long geometry paper coming at some point

Once you frame everything in the scope of that problem it’s straightforward to motivate Hamiltonian Monte Carlo.

  1. When Hamiltonian Monte Carlo is well-suited to a given target distribution (in the sense that a Markov chain Monte Carlo central limit theorem exists) it generates a precise Markov chain Monte Carlo estimators very efficiently.

  2. Dynamic Hamiltonian Monte Carlo implementations are well-suited to an extremely large class of target distributions, making it a robust choice for a default computational method. This can be verified both theoretically and empirically.

  3. When Hamiltonian Monte Carlo is not well-suited to a given target distribution (again in the sense that a Markov chain Monte Carlo central limit theorem does not exist) it fails in very distinct ways that let us know that it’s not working. This kind of sensitive self-diagnostic is extremely rare in computational statistical methods – indeed much of my work involves using using Stan to find problems in popular modeling techniques that have been ignored and develop practical resolutions.

In other words when Hamiltonian Monte Carlo works is works really fast, it works for a larger class of target models than most other algorithms, and when it fails it lets us know.

7 Likes

Thanks Mike, great points there that we can all use when we encounter clients/students/whoever with stars in their eyes about the latest fashionable algorithm.

Personally, I think that getting people to simulate some nasty density like high-dimensional banana or multimodal mixtures, then feed that into Stan and something else like SIR or PDMPs, is the best education. You soon realise (as I did) that those initial results of high speed on compliant densities do not matter a hoot compared to robustness under a wide class of problems.

1 Like