NUTS misses U-turns, runs in circles until max_treedepth

betanalpha · August 29, 2019, 9:15pm

Let me establish that the computation of \rho is not in question here, and q_{N} - q_{1} and is not relevant to discussion at all.

The current sampler is not trying to estimate or otherwise compute q_{N} - q_{1} for exact trajectories or numerical trajectories. The difference in positions appeared in the original No-U-Turn sampler as a heuristic based on the special geometry of the real numbers. The difference between the end points of exact trajectories doesn’t generalize beyond the real numbers. Moreover the difference between the end points of a numerical trajectory isn’t well-defined because of its dependence on a particular discretization.

The integrated momenta,

\rho = \int_{0}^{T} \mathrm{d} t \, p(z(t)),

however, is a proper geometric object. Take care as this is not a regular integral, both because of the implicit z(t) dependence but also because p and \rho are not real numbers but rather more general geometric objects. On Euclidean manifolds, where the No-U-Turn heuristic is well-defined, and exact trajectories these two definitions give the same termination criteria. The integrated momenta, however, generalizes to any manifold and hence is better suited for Hamiltonian Monte Carlo. In other words the original No-U-Turn criterion always approximated the new one, not the other way around.

For any function over phase space the natural estimator of its temporal expectation is the empirical average of evaluations over each discrete state,

\int_{0}^{T} \mathrm{d} t \, f(z(t)) = \frac{1}{N} \sum_{n = 1}^{N} f(z_{n}).

These estimators are unbiased as N \rightarrow \infty and enjoy nice convergence properties for finite N due to the nature of symplectic integrators.

To approximate the integrated momenta we just utilize the usual estimators for temporal expectations giving the current algorithm. For the very technically minded when computing \rho in practice we compute its components in a coordinate system, which are given by D functions defined as the compositions of the canonical momenta operator p: T^{*}Q \rightarrow T_{q}^{*} Q with local cotangent space coordinate functions, p_{i} : T_{q}^{*} Q \rightarrow \mathbb{R}^{i}.

The difference between the estimator for \rho and the value of q_{N} - q_{0} for a numerical trajectory is irrelevant because we are not trying to estimate the distance between the end points anymore. The difference between q_{N} - q_{0} for exact and numerical trajectories also doesn’t matter, nor does the difference between \rho or estimates of \rho and those two Euclidean distances. The distances are an artifact of a the original heuristic from which we have long since moved passed.

The current dynamic termination criterion is based on the geometric calculation p^{\sharp} \cdot \rho, and we approximate that second term using a numerical trajectory generated by a symplectic integrator per standard practice.

fehiepsi · August 29, 2019, 10:04pm

Thanks so much for a very detailed answer! I understand integral as in your integral equation

\frac{1}{T} \int_{0}^{T} \mathrm{d} t \, f(z(t)) = \frac{1}{N} \sum_{n = 1}^{N} f(z_{n})

while I thought that the current implementation uses

\frac{1}{T}\int_{0}^{T} \mathrm{d} t \, f(z(t)) = \frac{1}{N} \sum_{n = 0}^{N} f(z_{n})

where f(z_0) = f(z(0)) and f(z_N) = f(z(N)). If Stan implements rho using the former, which is

\sum_{n = 1}^{N} f(z_{n})

(instead of \sum_{n = 0}^{N} f(z_{n}) ) then that is correct and my assumption about the issue in current implementation is wrong. Indeed, the formula

\sum_{n = 1}^{N} f(z_{n})

is implemented in numpyro as

\sum_{n = 0}^{N} f(z_{n}) - f(z_0).

In summary, I think that the definition rho in Stan is a bit different from rho in numpyro (which is \sum_{n = 0}^{N} f(z_{n})). Maybe Stan works in float64 while the test I did in numpyro is in float32 (though I still can’t figure out how to get symmetry with the former formula).

betanalpha · August 30, 2019, 3:33am

Stan sums rho from all states in each numerical trajectory, including the initial point. In particular once a trajectory has been built the initial point has no exceptional status and is incorporated the same as any other point. This is necessary for the symmetry of the checks, which have to consider not only evolving forwards from the first point to the final point but also evolving backwards from the final point to the initial point.

If a numerical trajectory includes states \{z_{0}, z_{1}, z_{2}, z_{3}\} then rho will sum over all four of the momenta associated with those states, \rho = p_{0} + p_{1} + p_{2} + p_{3}. This is the proper estimator of \rho.

The issue in this thread does not concern how \rho is calculated – it concerns the gap between the old and new subtrajectories that form each expanded trajectory. Alternative checks, such as the one you use, can interact very differently with this gap but only at the expense of losing connection to what the numerical checks are meant to be approximating and propagating unexpected behavior elsewhere. The resolution of the issue is adding additional checks that cross the gap without compromising the core of the algorithm, which has been demonstrated with thoroughly empirical tests.

fehiepsi · August 30, 2019, 6:31am

Understood. I also hope that the new checks will be helpful. I will stop the discussion about whether the integral \int_0^T dt f(z(t)) is computed right or wrong here because it is not suitable for this thread, and there will be extra effort to discuss/explain/verify the theory, and the current tests already show that the new checks are good enough. Anyway, I think that the discussion above will be beneficial to someone, who loves the math of HMC papers, but be skeptical on how it is interpreted. I personally still don’t understand why sum of all momentum, which includes two boundary ones, is a proper estimate of \rho, when by definition, rho is an integral w.r.t. to time t, which is approximated by a formula which does not include both of two boundary momentum. But that is off-topic so I hope that one day I will figure it out.

nhuurre · August 30, 2019, 8:24am

(I really should stop fighting with @betanalpha, especially since @fehiepsi has already made peace with him, but here we go, one more time…)

The integrator is ostensibly accurate to second-order in stepsize but the computation in Stan is only first-order accurate. NumPyro uses an estimate that is also first-order but has the error in the opposite direction and apparently NumPyro avoids the issue at hand without any extra checks. Unlike @fehiepsi I don’t think there’s any deep reason why NumPyro’s integration is “better”. I think it just makes the window for looping narrower and consequently the tests miss it.

I know. I’m just using the trivial metric because it makes the math easier. An approximation that fails in the trivial case cannot work in the general case either.

Yes, and the accuracy of that appromation is O(\epsilon^2) if both are implemented optimally but only O(\epsilon) as it currently stands.

Here’s the simplest proof I can think of. We obviously have

\int_{0}^{T_1} \mathrm{d} t \, p + \int_{T_1}^{T_2} \mathrm{d} t \, p = \int_{0}^{T_2} \mathrm{d} t \, p

But the estimators are

\sum_{n = 1}^{M} p_n + \sum_{n = M}^{N} p_n = p_M + \sum_{n = 1}^{N} p_n

The state in the middle is counted twice if the integration is done in two pieces. The estimator is inconsistent with itself and the error this causes is one step. Stan accumulates the values as \rho = \rho_- + \rho_+ but that works only because the integrator moves one step before starting the next tree and we have

\rho_+ = \int_{T_1+\epsilon}^{T_2} \mathrm{d} t \, p \neq \int_{T_1}^{T_2} \mathrm{d} t \, p

(You have integral on the left hand side but an average on the right hand side. I think you meant \frac{T}{N} instead of \frac{1}{N}) For any function, you say? Let’s consider f(z)=1

\int_{0}^{T} \mathrm{d} t = \sum_{n = 1}^{N} f(z_{n})\epsilon = \sum_{n = 1}^{N} \epsilon = N\epsilon

Now, what’s the relation between T and N\epsilon? Let’s be careful not to make the classic “fencepost error”. N is the number of states but the number of steps is one less. So in fact \epsilon = \frac{T}{N-1} and therefore N\epsilon = T + \epsilon. This is not the best possible estimator. Your “standard practice” is wrong. The initial state and the final state of the trajectory should be counted with only half-weight.

I now see that while others have consistently called the initial state momentum p_0 while I chose the more confusing name p_1. Of course the initial state should be called state zero and state one is the state after the first step.

nhuurre · August 30, 2019, 6:45pm

Also, I’m embarrassed that I overlooked something important in the previous post.

Ditto. I really appreciate @betanalpha taking the time to participate in this fascinating discussion.

betanalpha · August 31, 2019, 11:48pm

I’m going to try one more time then bow out as the issue originally introduced in this thread has been resolved and I’m traveling for the next week.

I should be more careful and note that what we’re computing here are time averages, which just requires adding a normalization constant to the integral,

\left< f \right> = \frac{1}{T} \int_{0}^{T} \mathrm{d} t \, f(z(t))

Because the normalization is just a constant and doesn’t affect the termination criterion we ignore it.

When estimating these averages we compute empirical averages over states,

\hat{f} = \frac{1}{| \mathfrak{t} | } \sum_{z_n \in \mathfrak{t} } f(z_n),

where \mathfrak{t} is the set of states in a discrete trajectory evolved forwards T / \epsilon steps.

The subtlety here is that the states in a discrete trajectory don’t have well defined “times” because of the approximation inherent in the numerical evolution. These discrete states aren’t points on the exact trajectory with approximate times – they’re points on an entirely different trajectory that is approximating the true trajectory. One can see this with a symplectic integrator of a harmonic oscillator – the harmonic oscillator has a well-defined period but step sizes that divide into that period do not give period numerical trajectories!

For those interested this is formalized in the theory of backwards error analysis. The book “Geometric Numerical Integration” goes into extensive detail.

Once consequence of this is that one has to be careful when talking about consistency of averages over trajectory segments. Once can’t split up the exact trajectory and then try to approximate each time segment with separate numerical trajectories because you don’t know where to start any of those segments except the first one. Instead one has to construct a numerical trajectory spanning the full integration time and then split up those numerical states into segments.

Indeed this is how \rho is being computed in the current Stan sampler. The current numerical trajectory is recursively split into binary segments and a \rho estimated for each of those segmented numerical trajectories.

This has the very important feature that the empirical averages are symmetric in time just like the true averages. Because the evolution is reversible the time averages satisfy

\frac{1}{T} \int_{0}^{T} \mathrm{d} t \, f(z(t)) = \frac{1}{T} \int_{T}^{0} \mathrm{d} t \, f(z(t)),

and with the standard dynamic estimator we get

\hat{f} = \frac{1}{| \mathfrak{t} | } \sum_{z_n \in \mathfrak{t} } f(z_n) = \hat{f} = \frac{1}{| - \mathfrak{t} | } \sum_{z_n \in -\mathfrak{t} } f(z_n)

simply because \mathfrak{t} = - \mathfrak{t}. In particular this symmetry allows us to prove that the dynamic sampler exactly preserves the target distribution (up to floating point considerations) even when using the numerical integrator.

The ultimate source of the behavior originally pointed out in this thread is that we compute the termination criterion only between numerical segments. Because we don’t have any states between the left and right subtrajectories there’s no way for us to “interpolate” checks in that gap, and sometimes we miss the u-turn. The fix is to check not just the left and right sub trajectories internally but also make a check across sub trajectories with one additional state in them to span that gap.

One last point to keep in mind. Because the numerical trajectory only approximates the exact trajectory, the Euclidean distance between the initial and final states in a numerical trajectory is only an estimator of the Euclidean distance between the end points of the exact trajectory. In the Euclidean case \rho is not estimating the distance between the numerical endpoints but rather the distance between the exact endpoints. In other words \mathcal{O}(\epsilon) errors between the two are entirely expected! There’s no reason to correct \rho to the distance between the numerical endpoints because that’s not what the estimator is trying to calculate.

Topic		Replies	Views
HMC (jittered) vs. NUTS on 1000-dimensional standard normal Algorithms mcmc	9	3997	April 29, 2019
NUTS differences in Stan vs paper Algorithms	3	1153	February 2, 2017
NUTS vs HMC Algorithms	8	7159	August 4, 2020
Heuristics for the advantages of NUTS/HMC/Stan vs RWMH General	30	4047	September 13, 2017
Momentum Diffusion HMC? Algorithms	40	4716	August 24, 2017

NUTS misses U-turns, runs in circles until max_treedepth

Related topics