Developing a Deeper Intuition for HMC: Connecting the Underlying Concepts with the Hamiltonian Equations

I’m trying to develop a deeper understanding of HMC and how the sampling in Stan works, but it has proven to be difficult given the two main sources (Neal, Betancourt) that are out there. Most other tutorials are just copies of these two resources without providing any extra intuitions. I believe many of the questions I will ask below are commonly thought by other HMC beginners. I also believe that someone who has a thorough understanding of this algorithm will be able to see my line of reasoning and quickly identify where it goes wrong. Hopefully, the answers to my questions will aid other practitioners starting out with HMC.

I will start by describing the algorithm and some intuitions, then present two physical analogies, and finally pose my questions.

Algorithm and Intuition

HMC is typically used to sample from the posterior distribution of parameters \mathbf{x} given data \mathbf{D}. According to Bayes’ theorem, the posterior is:

p(\mathbf{x} \mid \mathbf{D}) = \frac{p(\mathbf{D} \mid \mathbf{x}) p(\mathbf{x})}{p(\mathbf{D})},

where:

  • p(\mathbf{D} \mid \mathbf{x}) is the likelihood of the data,
  • p(\mathbf{x}) is the prior,
  • p(\mathbf{D}) = \int p(\mathbf{D} \mid \mathbf{x}) p(\mathbf{x}) \, d\mathbf{x} = E_{\mathbf{x}}[p(\mathbf{D} \mid \mathbf{x})] is the marginal likelihood (normalization constant).

We aim to sample from p(\mathbf{x} \mid \mathbf{D}) . I like to explicitly state the normalization constants. Betancourt’s focus is on expectations, likely because almost any analysis we want to conduct can be operationalized through the computation of an expectation. Related to this is the typical set, which is where the probability mass p(x) \, dx dominates. This is an abstract concept that can be hard to understand. The main idea is that as we increase dimensions, the volume dx starts to dominate p(x) outside the neighborhood of the mode.

The idea behind HMC is to construct a vector field that aligns with this typical set. To do this, we utilize the gradients of our target distribution. We then need momentum to move around this landscape so that we don’t crash into the mode (since gradients point toward the mode). We also need to carefully add this momentum to stay within the typical set without drifting away into regions of low p(x) \, dx .

In HMC, the idea is to define a Hamiltonian and its corresponding equations for how the system evolves. The Hamiltonian is given by:

H(\mathbf{x}, \mathbf{p}) = U(\mathbf{x}) + K(\mathbf{p}),

where U(\mathbf{x}) = -\log p(\mathbf{x} \mid \mathbf{D}) is the potential energy, and K(\mathbf{p}) = \frac{1}{2} \mathbf{p}^\top M^{-1} \mathbf{p} is the kinetic energy, with momentum \mathbf{p} and mass matrix (M).

Hamilton’s equations are:

\frac{d\mathbf{x}}{dt} = \frac{\partial H}{\partial \mathbf{p}}, \quad \frac{d\mathbf{p}}{dt} = -\frac{\partial H}{\partial \mathbf{x}}.

Explicitly, these become:

\frac{d\mathbf{x}}{dt} = M^{-1} \mathbf{p}, \quad \frac{d\mathbf{p}}{dt} = -\nabla U(\mathbf{x}),

where \nabla U(\mathbf{x}) is the gradient of the potential energy.

The Hamiltonian represents the total energy in a system, and its equations describe a system that preserves energy. If the kinetic energy increases, the potential energy decreases, and vice versa.

Canonical Distribution

Another important distribution is the canonical distribution, which describes the probability of being in a specific energy state. It is given by:

p(x) = \frac{1}{Z} e^{\frac{-E(x)}{T}},

where ( E(x) ) is the energy and ( T ) is the temperature. Substituting the Hamiltonian, the canonical distribution becomes:

p(x) = \frac{1}{Z} e^{U(q)} e^{K(p)}.

Key points to note:

  1. The Hamiltonian preserves energy through its trajectories, so during a trajectory, the probability density in the canonical distribution remains constant.
  2. The two terms (( U(q) ) and ( K(p) )) are independent.

The process in HMC starts with:

  1. Sampling momentum from a normal distribution.
  2. Solving the Hamiltonian equations for a while in phase space.
  3. Discarding the momentum dimension and taking a sample from the target distribution.

Since the Hamiltonian equations cannot be solved analytically, symplectic integrators are used. These introduce some error but are volume-preserving, ensuring no drift away from the energy level. A Metropolis-Hastings step is then used to correct for any drift.

Physical Analogies

Neal’s Hockey Puck Analogy

In two dimensions, we can visualize the dynamics as that of a frictionless puck sliding over a surface of varying height. The state consists of the position of the puck (( q )) and the momentum of the puck (( p )). Potential energy (( U(q) )) is proportional to the surface height, and kinetic energy (( K(p) )) is proportional to the square of the momentum. On a level surface, the puck moves at a constant velocity. On a slope, the puck’s momentum decreases as it climbs (potential energy increases) until it stops, then slides back down (potential energy decreases and kinetic energy increases).

Betancourt’s Orbit Analogy

Exploring the typical set is like placing a satellite in a stable orbit around a planet. If the satellite starts at rest, it crashes into the planet. Adding the right amount of momentum allows the satellite to balance gravitational attraction, maintaining a stable orbit. Too little momentum causes a crash; too much causes the satellite to escape.


Questions

  1. Question 1: Why do we find the typical set given that we start in an arbitrary point and when we have found it, why do we stay in it?

    I am trying to intuitively understand this and connect it to the equations stated previously. This question really ties in with the two different physical analogies, and here is where my understanding goes off.

    If we take Betancourt’s analogy with the orbit: all other energy levels except the ones in the typical set stay in the same energy level. When we randomly sample from the momentum, we are moving to another energy level. This sample is independently sampled, not depending on our position, so how can we assume that this will be in the typical set? Furthermore, when we start out, why do we find the typical set? What in the equations determines this?

    If we view it from the perspective of Neal, things feel a bit easier. If we start in a position of high kinetic energy (thus low probability density in our target distribution), the momentum will also be low. If we tie this together with Betancourt’s orbit, this also means that we will start falling toward the mode and thus higher probability density. When we are at high probability density, the momentum will be higher, so we will not get stuck.

    I am trying to tie this together with the equations. What disturbs me about the equations is that, as I have understood, we have the gradients of the target distribution which point towards different local modes. We then flip them such that they are perpendicular to these modes. Why would a random point in space then point towards regions of higher probability density? Why do we then fall into the typical set, and why exactly the typical set? I don’t see any terms of ( p(x)dx ).

  2. Why do we even need the canonical distribution, and why these energy levels? Why do we want to have a system that is energy-preserving?

    This is maybe the largest hurdle to get your head around, and I feel that it ties in together with the previous question. The energy function feels extremely ad-hoc. Why do we even define it? Why do we need it, i feel it ties together with the fact that if the kinetic energy goes up, the potential energy has to go down in some sense… but this is already stated in the HMC system, thus why do we need the canonical distribution?

  3. Where does the normalization constant in our target distribution disappear, and what does this mean intuitively?

    • Potential answer: During the derivations of the Hamiltonian equations, it cancels out since it does not depend on our parameters (the derivative of ( V ) w.r.t ( q ) of something not dependent on ( q ) is zero).
    • What this means intuitively: We are traversing actual posterior space, thus our target distribution when simulating our system—not our target distribution up to a constant.
  4. The idea of the typical set and why Betancourt is so focused on it:

    • Potential answer: The idea of the typical set is really only relevant in super-high dimensions. We can’t traverse the whole posterior space in super-multi-dimensional spaces. We need to be smart about where we are traversing. What happens in multidimensional spaces is that the volume of space outside of the neighborhood of the mode starts to dominate the reduction in probability density moving from the mode to this space, but we can’t go too far. Somewhere in between the tails and the mode is the typical set, where ( p(x)dx ) is the highest.

    • The question: How can we be sure that we actually traverse the typical set and not around the mode? What in the equations states that we will be focusing on the space where ( p(x)dx ) is the highest?

  5. Why is Betancourt so focused on expectations?

    • Potential answer: I thought that what we really want in Bayesian inference are posteriors. Estimating a posterior involves estimating an expectation (the evidence/normalization constant). Furthermore, expectations cover more applications. For example, we could calculate the median of a posterior with expectations, etc. We can reformulate almost anything of interest into expectations.
3 Likes

I’m relatively amateur with this, so let me co-explore. Some impressions:

Metropolis-Hastings has a kernel for moving in the posterior space, it’s essentially diffusion. That’s almost by definition local, in the sense that small random fluctuations are summed up and average movement in time T is prop. to sqrt(T), so slow.

HMC meanwhile creates an artificial Hamiltonian flow, which allows faster transfer. Incidentally (or maybe not), gradients are of good use there.

For that, it augments the original space with the momentum “dual”. If the original probability was q, we now have momentum p too, so a p-q space.

By preservation of energy, if a place is low in q (probability), you move fast (high p), and end up in a place where q is higher and you move slowly. So on average you are in a high-probability region. Nothing guarantees you stay there, but nothing guarantees your sample wouldn’t sometimes be from low q(x) either. (This was your question 1.)

Now due to our artificial construction, we can later just throw p away and keep q, and you get a representative sample of q. (There are densities and measures and coordinates, I don’t even try to get my notation right.)

Because p is artificial to start with, you can choose your momentum coordinates freely? (They come from a normal distribution but I’m missing all details here. The mass matrix defines the metric of the space though.) For efficient sampling, you want different trajectories, so different energy levels. (This feels artificial for me too, and inefficient in some sense.)

The trajectory is not for climbing to mode but for sampling But it tends to, in terms of time, mostly lie in high-q(x) regions, just as q(x) “mostly lies in high q(x)-regions”. And when potential energy (low probability q) dominates momentum, then the system moves fast. (At the limit, does it approach the gradient of q?)

One energy level defines one trajectory, conditional on your coordinates. Many energy levels define many trajectories and allow escape from single trajectory. (Here I’m lost too, in terms of guarantees that this leads to a complete enough exploration, and otherwise too there’s a sense of artificiality here. Better not to say more as I haven’t read the papers properly.)

I’m not an expert, but energy preservation must be at the heart of Hamiltonian mechanics, I guess it is necessary for Liouville’s theorem to exist, which in turn guarantees the preservation of our probability measure somehow? (And an AI says that it guarantees reversibility, important for detailed balance of the sampler, not sure that makes any sense.)

Yes the normalization comes from where the system spends its time in a sense. For potential energy exp(q) the original normalization is just “plus const”, equivalent to a constant change in energy levels.

Expectations and Betancourt: if you are abstract and physical enough, the posterior is a pretty ephemeral object. It’s not even a density as the parameters are just an artificial coordinate system. Expectations over the posterior are more real though, and of course, invariant to reparametrizations of the model.

By preservation of energy, if a place is low in q (probability), you move fast (high p), and end up in a place where q is higher and you move slowly. So on average you are in a high-probability region. Nothing guarantees you stay there, but nothing guarantees your sample wouldn’t sometimes be from low q(x) either. (This was your question 1.)

I believe it is the complete opposite… if we consider the standard way of representing U(x) =-log(p(\theta|x)) this entails that when we are in regions of high probability density we will increase momentum(see the minus sign), otherwise we would constantly go towards the mode and crash into it from every newly sampled energy level?

One energy level defines one trajectory, conditional on your coordinates. Many energy levels define many trajectories and allow escape from single trajectory. (Here I’m lost too, in terms of guarantees that this leads to a complete enough exploration, and otherwise too there’s a sense of artificiality here. Better not to say more as I haven’t read the papers properly.)

I wonder why we choose exactly this distribution, i mean, it has some properties that are useful, always >0, integrates to one etc for arbitrary E but other than that i dont see its usage… why dont directly define the joint distribution as \frac{1}{Z}log(p(\theta|x)) * N(0, M)… feels even more strange when viewing this youtube-tutorial on the “Intuition”: The intuiton behind the hamiltonian monte carlo where there is a strong focus on what the canonical distribution represents… trayectories with lower energy gets sampled more frequently.

Yes the normalization comes from where the system spends its time in a sense. For potential energy exp(q) the original normalization is just “plus const”, equivalent to a constant change in energy levels.

i dont understand this.

Should be log q(x), the normalizer becomes a constant additive term of total energy.

(I used q for probability because p is typically momentum, somewhat reserved.)

Yes, you are right of course, I should have stayed out of this topic. ;)

Maybe I’ll try to look one of those “intuitive” videos.

there is definitely a need for proper tutorials on HMC… every single tutorial out there are just copies of Betancourts conceptual introduction. Feels like everyone is just using it without a proper knowledge of what is actually going on under the hood.

We also need to understand to a certain extent what the differential equations actually mean/how they affect the system. - the flip of the derivatives and the minus sign must be that we want to flip the gradients 90 degrees, typical rotation in linear algebra.

Will be really interesting to see if someone that really understands these things will answer here, i think we will have a lot of “aha” moments if that happens…

You seem to refer to the Hamilton’s equations? For that part there’s lots of introductory material from physics. In the context of HMC they appear in a such an abstract form, that understanding would be at the level of Hamiltonian mechanics in general. I don’t understand it, but I think Liouville’s theorem, or volume preservation of the phase space is relevant and one of the more understandable aspects there.

The rotation-like nature of the equations is indeed curious. It’s btw one of my favorite tests for new LLMs, to ask them why Hamiltonian equations look like something is rotating. ;)

Expectations

Betancourt is focused on expectations because they are the most general mathematically valid tool for working with probability distributions. It’s not just almost anything of interest that can be reformulated as an expectation; it is anything of interest. Bayesian decision theory is entirely built on expected loss function. If you’re making decisions that can’t be phrased as minimizing some expected loss function, your strategy is “incoherent” by some “Dutch book” argument.

Typical set

I think “typical set” is more of a metaphor. The point is that in high dimensional space, the mode is a poor representative of the distribution.

Sources rarely give a mathematically precise definition. The trouble is that p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x} is a volume form, not a real number, and forms at different locations cannot be compared in a coordinate-independent way. Mathematically speaking, there is no such thing as “the space where p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x} is the highest.”

Figure 3 in Betancourt’s Conceptual introduction suggests something like:

Compute the (1-dimensional) distribution of f\left(\mathbf{x}\right) = \log p\left(\mathbf{x}\right) and find the highest density interval with probability level \alpha. The inverse image of that interval f^{-1}\left[I_\alpha\right] is then the typical set.

Note that:

  1. There is not one typical set but an infinite family parametrized by the probability \alpha (the typical set in the limit \alpha\to 1 is just the support–it is not interesting)
  2. The typical set depends on the (coordinate-dependent) choice of the volume term \mathrm{d}\mathbf{x} (since p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x} is invariant and p\left(\mathbf{x}\right) = \frac{p\left(\mathbf{x}\right)\mathrm{d}\mathbf{x}}{\mathrm{d}\mathbf{x}})
  3. A fraction 1-\alpha of the samples should be outside the typical set

(1) and (3) mean it cannot be literally true that HMC trajectories stay in the typical set.
(2) is not relevant since HMC depends on parameterization anyway.

Normalization constant

The normalization constant disappears from the equations because Markov Chain Monte Carlo (not just HMC but also Metropolis-Hastings, etc) solves the difficult problem of computing p\left(\mathbf{D}\right) using the elegant method of assuming that someone else has already solved it.

What I mean is: a single step of MCMC requires a random sample from the target distribution as an input and samples from near that point. You don’t sample from the target distribution but from some distorted version of the target distribution restricted into a neighborhood of the input sample. Quantities that depend on behavior outside the neighborhood, such as p\left(\mathbf{D}\right), are irrelevant.

The only global constraint is that you have to ensure the full target distribution is recovered when averaging over all input points drawn from the target distribution.

This pulling-yourself-up-by-your-own-bootstraps logic is most obvious in the limit where the Metropolis acceptance probability goes to zero: the output sample is the same as the input sample; the output is a good sample because the input was already good.

Of course, perfectly correlated samples aren’t useful. The point of MCMC is in injecting fresh randomness (creating uncorrelated samples) while preserving the distribution.

Canonical distribution

Designing a nontrivial transition kernel that preserves a given distribution isn’t so easy.
Metropolis correction can force any transition kernel to preserve the target distribution but if the kernel didn’t already approximately preserve the distribution, it just collapses into an almost-trivial kernel.

However, there is one situation where it becomes much easier: when the target space is a symplectic manifold. There Hamiltonian flow generates long distance transitions that successfully preserve the distribution, and a symplectic integrator can implement it with high Metropolis acceptance probability.

The great thing is that any space can be lifted into a symplectic space simply by adding momentum variables. The uplifted version of the distribution of interest augmented with IID normal momenta is called the canonical distribution. (The choice of distribution on momentum variables is arbitrary but you need to be able to sample it separately from position.)

Ergodicity

None of that answers the all-important question of what happens when we start at an arbitrary point outside of the “typical set.”

Iterated transition kernel converges to a stationary distribution, i.e. a distribution that no longer changes on the application of the kernel. By construction, our target distribution is a stationary distribution but the question is whether it is the same as the one our starting point converges to.

Alas, there’s no deep theory here. The hope is that the kernel has only one stationary distribution (Betancourt calls this condition “geometric ergodicity”). It is essentially impossible to prove that a specific kernel is ergodic, but it should hold generically.

I interpret “different local modes” to mean you’re thinking of a multimodal distribution? HMC is pretty bad at sampling from those. The typical set of a multimodal distribution is (at least if the modes are far apart) disconnected and HMC easily gets stuck in a single connected component.

Keep in mind that “flipping them perpendicular” happens in phase space, not configuration space. The force vector still points towards (configuration space) mode and high probability density but the force acts on momentum rather than position. If the gradient is larger than the initial momentum the particle will soon move into the high density region anyway, whizz past the mode at high speed and encounter an equally-strong gradient in the opposite direction that finally stops it. Then the sampler picks a random point along the trajectory, a point which often has a lot of kinetic energy.

That kinetic energy disappears on momentum resampling, leading to an overall energy loss. HMC approximates a randomized-step-size gradient descent, until the average kinetic energy gained from falling becomes less than the average kinetic energy of resampled momentum.

If the gradient is weak, the particle moves in a straight line until it finds a strong enough gradient. Randomly sampled momentum compels the particle to spread out into all available volume.

“Typical set” is the region where the “energetic force” of gradient pulling towards the high probability density is in balance with the “entropic pressure” of momentum resampling pushing out towards the maximal volume.

5 Likes

Betancourt was using this informally to mean the region of high posterior probability mass. The typical set in information theory is formally all the parameter values whose log densities densities are within \pm \epsilon of the expected log density, i.e.,

T_\epsilon = \{ \theta : \theta \in \mathbb{E}[\log p(\theta \mid y) \mid y] \pm \epsilon \}.

See: Typical set - Wikipedia

For our purposes here, I would stay away from informal use of a technical term and informally talk about the region of high posterior probability mass, and formally talk about something like central intervals of log density rather than fixed intervals. This is what Betancourt’s more formal definition is doing with highest posterior density intervals, which doesn’t match the formal definition of “typical set” from information theory.

There is a really cool visualization of HMC… I think McElwreath shared it once… showing the trajectories in a 2-d space. Any idea what I’m talking about?

I like this one GitHub - chi-feng/mcmc-demo: Interactive Markov-chain Monte Carlo Javascript demos

2 Likes

nice, thanks for the link!