I want to do a small stand-alone model to ensure I have some basic, algorithmic understanding of Hamiltonion MC, on something like a simple linear regression with normally distributed residuals (N(y-XB, sigma)). I picked up somewhere that HMC is a variant of a Metropolis algorithm, where we instead leverage gradient information.
So geometrically, with the gradient, and a negative log likelihood, we can follow this gradient to areas of higher probabiliy mass (some local mode/ local max). I have heard of (and used/implemented) the metropolis adjusted Langavin, and that is this concept exactly.
However, in Metropolis Adjusted Langevin (if I remember correctly, which is often not the case) we are only looking at the gradients of the likelihood function. (not in the Reimman Manifold Langevin). In Hamilton’s Equations, we have two derivatives WRT time, I believe Mike says somewhere that potential is completely determined by target distribution. So in this case - in metropolis adjusted langevin, is this the gradient (of the likelihood function) we used to go to high density area the potential energy?
OK - then what is kinetic energy? If I’m on the right track so far, then if I have a MAL (metrop adjusted langevin), I have to come up with kinetic energy. Can I think of the kinetic energy as the proposal distribution?
Say I used MAL and then explore than area with a MV gaussian random walk, is this MAL, then, a special case of Hamiltonion Montecarlo with a naive choice for kinetic energy, (I think if we’re using MV gaussian proposal distribution, this is the “euclidean gaussian” correct?).
Have you seen the Betancourt intro paper (https://arxiv.org/abs/1701.02434)? That has good descriptions of the math. It’ll answer your questions.
The Radford Neal intro paper (https://arxiv.org/pdf/1206.1901.pdf) comes with copy-paste R code and a big section on tips-n-tricks to make it fast.
I think it might be somewhat misleading to look for similarities between HMC and MALA. They are similar in that they both employ flows that fix particular stationary distributions of interest, but while MALA pushes towards local modes, the Hamiltonian flow follows the constant-value contours of the Hamiltonian function, essentially pushing the probability mass around in a closed loop. The article @bbbales2 linked to does a pretty great job of illustrating this, although it doesn’t compare it to MALA.
I think this blog post and associated paper/comments might be helpful, but it quickly gets very… geometrical.
I’m pretty sure MALA is just HMC restricted to a single leapfrog step.
The potential is just the negative log density, so the potential used to update momentum also pulls HMC toward lower potential, higher log density states. It’s just balanced by the kinetic energy, which is why it follows the Hamiltonian flows. Betancourt’s paper does a good job describing all this.
HMC is a form of Metropolis, but NUTS, as used by default for Stan’s sampling, is not.
I agree that implementation-wise they are similar in that they both follow flows based on the gradient, but MALA follows an actual gradient flow (plus diffusion) which is different from a symplectic Hamiltonian flow. The latter is volume-preserving, for one, which is the reason for using a leapfrog integrator which I don’t even think is applicable for MALA. As a final argument, you wouldn’t get HMC by running MALA on the Hamiltonian.
Running continuous-time Langevin dynamics has a well-defined asymptotic stationary distribution, so you are you just kind of hoping that running the discretized dynamics for long enough will yield proposals that are adequate. Something similar seems to be behind HMC, although it’s still a bit obscure to me, but the blog+article I linked to (by Betancourt) details that HMC leaves a family of distributions invariant.
This is correct. MALA exploits a Langevin diffusion on the same extended phase space that Hamiltonian methods use, it’s just easier to hide the auxiliary parameters when you just make one step at at time.
Hmm, I guess there must be something I am not getting. When you say you are running the diffusion on the extended space, I assume we’re using the Hamiltonian from HMC as our target potential to calculate the MALA steps? So moving a small step orthogonal to the contour, and then applying random diffusion, is equal to jumping to another contour via momentum resampling and then following the new contour?
Run HMC for one step – randomly sample momenta, then numerically evolve the Hamiltonian dynamics forwards for one step, then throw the momenta away. For an Euler numerical integrator this immediately recovers the Euler-Mayorama discretization of a Langevin diffusion.
If you add a Metropolis correction to the new state then you recover the Metropolis Adjusted Langevin Algorithm (MALA). This can be considered as a Metropolis correction on the full phase space or a Metropolis-Hastings correction on the position space alone.
In fact all diffusions benefit from the dynamics perspective. The action of a Brownian motion is to sample a random element of the cotangent space at the current point, or perturb an existing element. Given this element you have a point on phase space that you can then use to evolve Hamiltonian dynamics. If you run for an infinitesimal amount of time then sampling a new momenta you get a Langevin diffusion. If you run for longer than you get Hamiltonian methods.