Definition of the energy__ term

I’ve been reading @betanalpha’s publication (https://arxiv.org/abs/1604.00695) on diagnosing inefficient exploration of the marginal energy distribution and would like to ensure I understand its implications in the context of Stan’s implementation.

Let’s assume we use a Euclidean-Gaussian HMC sampler with potential U(q)=-\log P(q), where q is the D-dimensional parameter vector and P(q) is the target density. Because we use a Euclidean-Gaussian sampler, the momentum resampling step generates kinetic energy terms T(p)|q\sim \frac12 \chi^2(D), where p is the vector of momenta. The total energy is E(q,p)=U(q)+T(p).

So far so good; let’s consider a particular example as shown in the figure below: the sampler starts with a given parameter and freshly sampled momentum at (1), evolves the state under the Hamiltonian flow to end up at (2), resamples the momentum to obtain (3), followed by another evolution step to end up at (4).

We know that T(p_1) and T(p_3) follow the \frac 12\chi^2(D) distribution discussed above by construction. However, the kinetic energy term at (2) (and similarly for (4)) is given by

\begin{align} T(p_2)&=E(q_1,p_1) - U(q_2)\\ &= U(q_1) + T(p_1) - U(f_q(q_1, p_1)), \end{align}

where f_q(q_1, p_1) is the evolution of q under the flow; the distribution of T(p_2) seems non-trivial and depends on q_1.

Stan only provides one energy__ value for each sample (which is fine because energy is conserved), but it is not obviously clear to me what the corresponding state is. I’ve run some toy models and evaluated kinetic_energy = energy__ + lp__ as provided by Stan. Indeed, the kinetic energy terms follow the \chi^2 distribution discussed above which leads me to the conclusion that Stan reports the state after momentum resampling but before evolution (i.e. states (1) or (2) rather than (3) or (4)). Is that interpretation correct?

The relevant code is here

The energy__ is calculated after evolution and before momentum resampling.

Keep in mind that the Hamiltonian flow preserves not only P(q) but also the entire phase space target distribution. That means T(p_2)|q_2 (conditioning on q_2, integrating out q_1,p_1) still follows \chi^2 distribution assuming q_1 is drawn from P(q). That is, once the Markov chain has converged the distribution of the evolved momentum equals the resampling distribution.

Recall how each Hamiltonian transition evolves:

  1. we start with an initial position, q0.
  2. an initial momenta p0 is sampled from the conditional density pi(p | q0).
  3. a numerical Hamiltonian trajectory expands around the initial point (p0, q0) to give the points (qn, pn).
  4. a final state is sampled from the entire numerical trajectory with the relative probabilities P[q_n, p_n] \propto \exp(-H(q_n, p_n)).
  5. The energy diagnostic is saved E_n = H(q_n, p_n).
  6. The state is projected back down to the unconstrained parameter space (qn, pn) -> qn
  7. The new state is saved, x_n = \text{constrain}(q_n).

Consequently the energy__ variable is precisely the value of the Hamiltonian function at the state sampled for the given iteration. It is meant to probe how well the sampler is exploring between Hamiltonian level sets and nothing about the variation of the kinetic or potential energy individually.

One has to be careful with the expected behavior of the individual kinetic and potential energy terms both within a transition and between transitions. Within a transition the initial kinetic energy follows a \chi^2 but as the state evolves forward and backwards in time the kinetic energy mixes with the potential energy in generally complex ways. For Euclidean-Gaussian disintegrations/kinetic energy functions the distribution of kinetic energy along the trajectory will also follow a chi^{2} provided that the sampler is in equilibrium.

Because the total Hamiltonian is conserved, the variation in the potential energy will also follow a \chi^{2} once in equilibrium. Incidentally this is why Hamiltonian Monte Carlo struggles on funnel like target densities – the typical set spans large changes in potential energy but the change of potential energy within each trajectory is bounded by these equilibrium behaviors.

Between trajectories one has to consider both the distribution of the old momenta that is thrown away from the previous state and the new momenta that is freshly sampled. In equilibrium the discarded kinetic energy will follow a \chi^{2} as above and the difference in kinetic energies will follow a difference of chi^{2}, which itself is approximately chi^{2}. Out of equilibrium, however, the variation in the discarded kinetic energy will be much more complicated as the sampler sheds energy on its way into the typical set.

I advise a bit of precision in the language here. If (q_{2}, p_{2}) is a point sampled uniformly from a sufficiently long Hamiltonian trajectory that includes (q_{1}, p_{1}) then yes T(p_{2}) will follow a \chi^{2} in equilibrium. If (q_{2}, p_{2}) is defined as a fixed evolution from (q_{1}, p_{1}), however, then the conditional behavior of T(p_{2}) given (q_{1}, p_{1}) will be much more complicated, even in equilibrium. The behavior of T(p_{2}) given q_{1} after marginalizing out the resampled p_{1} will tend towards a \chi^{2} in both cases.