I wasn’t sure which category to tag this with, so feel free to move this to another category.

I’ve been wondering about a few things regarding how Stan works. In particular, I’m curious what the motivation is behind using HMC, and also why the Hamilton’s equations are solved forward in time, as opposed to backwards.

Also, I’m curious why the Leapfrog integrator is used to solve the system? I’m aware that this integrator is time-reversible and also preserves energy, but I don’t understand why these properties are important for solving the Hamilton’s equations.

I’m not on expert on this, so maybe some others can jump in with a bite sized response. I’d recommend reading through: https://arxiv.org/pdf/1701.02434.pdf which probably has the answers you’re looking for.

The paper linked by @js592 is definitely the definitive place to start!

But super quickly in response to your questions:

Stan integrates both forwards and backwards, choosing the direction at random at each expansion of the tree.

Preserving the energy is important in part because by preserving the energy we avoid flying out of the bulk of the posterior mass into regions of negligible probability density. Additionally, divergences in the trajectories provide an extremely important diagnostic for unreliable computation. See Diagnosing Biased Inference with Divergences

If I may come with a follow up question: Why does Stan integrate both forwards and backwards in that way? From what I understand the Leapfrog integrator is time-reversible and also Hamilton’s equations are symmetric under time reversal. Hence, what impact does the choice of integrating forwards vs backwards have?

Due to the dynamic stopping criterion, the actual transitions themselves are not guaranteed to be time reversible in dynamic HMC unless the tree is grown in a particular way. For more, see https://arxiv.org/pdf/1111.4246.pdf

Just wanted to add some additional context and intuition. I can provide technical references if there’s interest – just ask.

Integrating Hamilton’s equations from an initial state both forward and backwards in time generates equally useful exploration. Integrating forward in time is the default simply because it avoids a minus sign.

That said no matter the direction of time a Hamiltonian trajectory will not define a valid Markov transition. The problem is that the deterministic update is too singular in some sense. Indeed deterministic updates define valid Markov transitions only if they are idempotent, so that applying them twice returns the initial point. Actual Hamiltonian Markov transitions are a bit more sophisticated than this but at some level they all have to use both forward and backwards integration times (or equivalently flipped momenta when the kinetic energy is an even function) for similar reasons.

In practice we can’t solve Hamilton’s equations analytically and instead have to approximate solutions numerically. Most numerical integrators suffer from drift which pulls the numerical trajectory away from the exact trajectory. Even worse this drift grows incredibly fast with dimension and drifting numerical integrators can be run for only very short times before they diverge too far away from the meaningful regions of the ambient space. For example if they were used to generate Metropolis proposals then the acceptance probability would rapidly decay with integration time and we wouldn’t be move very far.

We can also think about the error introduced by numerical integrators as warping the invariant distribution away from the desired target distribution. To correct for this error we need to quantify the strength of this warping – mathematically this requires the evaluation of something called a Radon-Nikodym derivative or sometimes simply “Jacobian” of the numerical evolution. For example that object is needed to construct well-behaved Metropolis-Hastings acceptance probabilities. Unfortunately for most numerical integrators this quantity is really annoying to evaluate, especially if the warping is strong.

The leapfrog integrator is an example of a symplectic integrator which are outright mathematical magic. When they are stable symplectic integrators are confined very closely to the same Hamiltonian level set the exact trajectory is confined, and hence explore similar neighborhoods even after every long integration times. Moreover the warping that does exist can be completely characterized by straightforward evaluations of the Hamiltonian function, making their use in Metropolis and similar methods undemanding. Unstable symplectic integrators behave completely differently which makes instabilities straightforward to diagnose empirically – these are the “divergences” that @jsocolar mentioned.

The analysis of realistic Hamiltonian Markov transitions from numerical Hamiltonian trajectories is a lot easier when we think not about deterministically integrating from an initial point but rather about randomly sampling a numerical trajectory that contains the initial point. For a fixed integration time each possible trajectory can be considered integrating forward and backward for a different number of steps. See for example the appendix in [1701.02434] A Conceptual Introduction to Hamiltonian Monte Carlo.