NUTS misses U-turns, runs in circles until max_treedepth

This is what keeps tripping me up.

In my experiments I wasn’t trying to do a proper u-turn check but only to detect the wraparounds. In that case compute_criterion(p_sharp_bck_fwd, p_sharp_fwd_bck, rho) is enough to guarantee that the usual no-u-turn check won’t miss the u-turn.

I agree the overhead is negligible either way.

I don’t think anyone can intuit this, which is why I’m careful to thoroughly check the resulting sampling behavior.

@nhuurre I am not sure if it helps but I think the criterion is not to check the angles of p_1 vs sum(p_1,..., p_n) or p_n vs sum(p_1,..., p_n) but to check the angles of p_1 vs sum(p_2,..., p_n) and pn vs sum(p1,..., p_{n-1}). My reason is q_n - q_1 ~ sum(p_2,..., p_n) * step_size, not q_n - q_1 ~ sum(p_1,..., p_n) * step_size. I use this criterion and do not observe your topic’s issue. Regardless, I hope that the discussions here will lead to a better criterion for Stan. :)

The sum is just a first order approximation to the integral. Omitting one momentum amounts to a difference that is proportional to the step size. So it’s still a first order approximation. I don’t have intuition which one is more accurate or if there even is a consistent difference. However, q_n-q_1 should be the same whichever direction you’re going so I think if you’re going to do that you should instead use the average of forward and backward sums which is

q_n - q_1 \approx \frac{1}{2}p_1 + \sum_{k=2}^{n-1}p_k + \frac{1}{2}p_n

Again, no idea if that \frac{1}{2}\left(p_1+p_2\right) difference improves the approximation. Maybe looking at how it works when the trajectory is just a single step will clarify this?

So we start at q_1 with momentum p_1 . The first momentum half-step takes us to

p^\prime = p_1 + \frac{1}{2}f_1

where f_1 is the force at q_1 and I have chosen units such that the stepsize is one. Next position full-step moves us to

q_2 = q_1 + p^\prime = q_1 + p_1 + \frac{1}{2}f_1

and finally the second momentum half-step completes the transition

p_2 = p^\prime + \frac{1}{2}f_2 = p_1 + \frac{f_1+f_2}{2}

Rearranging these gives

\begin{eqnarray*} p_2 - p_1 & = & \frac{f_1+f_2}{2}\\ q_2 - q_1 & = & p_1 + \frac{f_1}{2} = p_2 - \frac{f_2}{2}\\ & = & \frac{p_1+p_2}{2} + \frac{f_1 - f_2}{4}\\ \end{eqnarray*}

Actually, I now see that this can be summed over a trajectory of arbitrary length.

\begin{eqnarray*} q_{n}-q_{1} & = & \left(q_{n}-q_{n-1}\right)+\left(q_{n-1}-q_{n-2}\right)+\dots+\left(q_{2}-q_{1}\right) \\ & = & \frac{1}{2}p_{1}+\sum_{k=2}^{n-1}p_{k}+\frac{1}{2}p_{n}+\frac{f_{1}-f_{n}}{4} \end{eqnarray*}

Huh, that’s surprisingly simple. Can it really be this easy?

2 Likes

Yeah, I think that your approximation might be more accurate. But assume that there are only two points, then to me, computing p1.p2 to decide U-turn is better than computing p1.(p1+p2)/2 and p2.(p1+p2)/2. Anyway, using your approximation might also solve the topic issue but I haven’t tested it yet. The point I want to make here is there is an additional term, whether it is p1 or pn or your proposal (p1+pn)/2, in the approximation of current Stan criterion, which causes your issue.

The difference between the two is actually proportional to the square of the step size, which you can see if you restore the step sizes in your calculation.

But you have to be careful about what is the approximating what. The original no-u-turn condition, p_{N} \cdot (q_{N} - q_{0}) > 0, (-p_{0}) \cdot (q_{0} - q_{N}) > 0, is well-defined only on Euclidean manifolds and hence it’s particularly well-defined in terms of the Hamiltonian system itself. On the other hand the integral of the canonical form (sum over momenta) introduced in https://arxiv.org/abs/1304.1920 is well-defined with respect to the Hamiltonian system and doesn’t rely on the particular structure of a given Euclidean space.

The two are equal in the \epsilon \rightarrow 0 limit, but ultimately the original no-u-turn criterion is what approximates this sum generalized criterion for a finite step size integrator, and not the other way around.

One immediate benefit of this more general condition is that it is immediately symmetric between states, so the same sum can be used for both the forward and reverse checks. Removing the boundary momenta results in an asymmetric sum that will break the requisite symmetry is used naively. To ensure the right invariant distribution one would have to either keep all of the trajectory states around to compute the asymmetry sums on the fly or keep boundary states around to correct the total sum, which isn’t much different to what the current PR does.

The advantage of the current PR, and the current behavior in develop, is that it is more immediately connected to the proper theory and hence easier to demonstrate the desired invariant distribution. That connection to theory also helps to elucidate what effects the discrete trajectories are introducing and motivate principled understanding for how to manage them.

1 Like

The dot product becomes negative as soon as the angle between the vectors exceeds 90 degrees which is not a u-turn. To detect when the turn is near 180 degrees you need to compare the momenta not to each other but to some kind of a “pivot vector” that lies between them. I don’t think p1.(p1+p2)/2 is too strict. When sampling from a two dimensional IID normal Stan does limit about 15% of the transitions to a single leapfrog step so the current algorithm has no problem stopping at short u-turns. Limiting max_treedepth to 1 or even 2 seriously lowers effective sample size so I think even 15% is quite a lot.

If I restore the step sizes I get

\begin{eqnarray*} q_{n}-q_{1} & = & \frac{1}{2}\epsilon p_{1}+\sum_{k=2}^{n-1}\epsilon p_{k}+\frac{1}{2}\epsilon p_{n}+\epsilon^2\frac{f_{1}-f_{n}}{4}\\ & = & \sum_{k=1}^n\epsilon p_{k}-\epsilon\left(\frac{1}{2}p_{1}+\frac{1}{2}p_{n}\right)+\epsilon^2\frac{f_{1}-f_{n}}{4}\\ & = & \epsilon\sum_{k=1}^n p_{k}+O\left(\epsilon\right) \end{eqnarray*}

So no, the rho used by the u-turn criterion is a first order approximation to q_n-q_1 (or vice versa). It is the symmetrically shortened sum that is a second order approximation.

I can’t help but digress here. The above calculation suggests a nice visualization for the leapfrog trajectories. Positions q_k are represented by black dots in the obvious way. The force at position q_k is drawn as a vector that starts from q_k and has length \frac{1}{2}\epsilon^2 f_k . The momentum is then drawn as a vector \epsilon p_k placed such that its center coincides with the center of the corresponding force vector. It then happens that the momenta line up end-to-end and create a contiguous “spine” for the trajectory.

Circles are boring so I’ll use inverse square potential for the picture. (this potential is of course pathological due to the infinite peak in the middle and the improper flat tail but let’s not worry about those regions.)
Here is a trajectory for a two dimensional \frac{-1}{2\left|q\right|^2} potential starting at q=(1.5, 0) and p=(0, 0.66).

Symplectic geometry is more than just the position space. To see what the trajectory looks like in the momentum space move all the momenta to the origin while keeping the force vectors attached to their respective momenta. Now the forces create a contiguous path!

With all the relevant quantities in the same picture it’s possible to visualize different u-turn conditions. The original u-turn condition stops the sampler when the line connecting the first and last position makes an acute angle with at least one endpoint momentum.

uturn1

The condition in Stan 2.20 uses a line connecting the ends of the momentum spine.

uturn2

@fehiepsi proposed something like

uturn3

Indeed. I’m not even sure anymore if you can say it is more “complicated.” It has couple of variables more but their purpose is absolutely clear. Any proposed alternative needs a solid theoretical justification first.

4 Likes

@nhuurre your visualization is pretty nice! Thanks for making it! I don’t have much opinion on whether we approximate an integral by using starting points of each bin, ending points of each bin, or your proposal middle points of each bin. I just want to point out that there is an extra term in the approximation of the integral of current Stan implementation (which seems incorrect to me), and it might cause your topic issue because I didn’t observe the issue in my implementation.

@betanalpha I don’t know how complicated it is in Stan implementation, but in my implementation, I just compute (p_sum - p_left) * p_left and similar for the right point (I ignored mass matrix here for simplicity). For @nhuurre proposal, we just need to compute (p_sum - p_left/2 - p_right/2) * p_left and similar for the right point. No need to store any thing else, and these criterions are symmetric. Maybe I misunderstand your point?

Check out the pull request @betanalpha made for the details. I think Stan’s binary recursion means your change has to be implemented in similar fashion. What is your implementation anyway, can you share the code?

@nhuurre you can find the code here (NUTS is implemented in iterative fashion there but I think that the turning condition would be the same with recursive style). I made this gist to print out n_leapfrog but you can explore more if you want. The functional pseudo-random number generator mechanism allows the result be reproduced (even across multi-chains) in your system.

Thanks! I can confirm that NumPyro does not suffer from the problem. Based on some quick testing it looks like the shorter criterion does not impact effective sample sizes in two dimensions (contrary to my claim) and in general makes the chains somewhat less antithetic much like Stan#PR2800.

NumPyro stores p_left and p_right and then multiplies those by the inverse mass matrix every time it wants to check for a u-turn. Stan#develop first multiples ps by the inv-mass matrix and then stores only p_sharps. Stan#PR2800 stores ps (like NumPyro) but also has the precomputed p_sharps. I think all these checks are so fast compared to gradient calculation that arguing over micro-optimizations isn’t worth it. Unlike Stan, NumPyro’s shorter u-turn criterion does not need to be augmented by additional internal checks.

What I’ve had in mind is to keep Stan#develop’s u-turn criterion as it is and add an extra “convexity” check that looks like this:

convex

The green triangle is rho_fwd, rho_bck and rho.
The red line is orthogonal to rho and the criterion is that p_fwd_bkc must lie on the same side of the red line as rho_fwd. (and symmetrically for p_bck_fwd and rho_bck) I believe this is equivalent to

p_sharp_fwd_bck.dot(rho) > 0 && p_sharp_bck_fwd.dot(rho) > 0

The above trajectory continues growing but after the doubling the convexity condition triggers (just barely):

In this picture the u-turn criterion also triggers but notice that the convexity criterion doesn’t care if the ends of the spine overlap and thus can resolve the potential pathology.

1 Like

Hi all. Perhaps relevant is this paper by Joonha Park and Yves Atchadé: https://arxiv.org/abs/1907.06544 which states:
“We explore a general framework in Markov chain Monte Carlo (MCMC) sampling where sequential proposals are tried as a candidate for the next state of the Markov chain. . . . We develop two novel methods in which the trajectories leading to proposals in HMC are automatically tuned to avoid doubling back, as in the No-U-Turn sampler (NUTS). The numerical efficiency of these new methods compare favorably to the NUTS. We additionally show that the sequential-proposal bouncy particle sampler enables the constructed Markov chain to pass through regions of low target density and thus facilitates better mixing of the chain when the target density is multimodal.”

1 Like

@nhuurre if you’re feeling energetic there’s a problem with the U-turn stuff: https://github.com/stan-dev/stan/pull/2800 .

Either there’s problem with the new checks we’re proposing or there’s a bug in the code.

I could go for @fehiepsi 's shortened checks too, but I’d still like to figure out the problem with these additional checks as posed (cause it sure seems like they should work from my understanding of the problem).

2 Likes

I can help testing the checks in numpyro too (will do it this weekend). To be more precise about the issue I wanted to point out, we can take a look at @betanalpha paper, section A.4.2. We have

rho_t = integral_t=0->T p(t) dt

then it is approximated by

rho_t = sum_t=0->T p(t) dt

at the last formula. But using Riemann sums means that the sum should be taken from t=0->T-1 or from t=1->T (or even better: to go with @nhuurre proposal). That’s my reasoning from theoretical point of view. For T large, this is not a problem but for T small, an incorrect approximation might lead to a big problem.

@nhuurre @fehiepsi, looks like @betanalpha found the issue, so ignore me :D.

Thanks for the offering to reimplement it though. I don’t think it’s necessary now. Esp. since you’re already solving the problem with less code haha.

Unfortunately this is not the relevant theory. Riemann sums are first-order quadrature techniques that are applicable when integrating functions over a real line but that is not the circumstance here. Here we are integration a function along the evolution of a dynamical system, that evolution being approximated by a symplectic integrator. When using a symplectic integrator the best approximation for one of these temporal integrals is integrating over all of the states in the trajectory.

Didn’t we already go over this? My conclusion above was that

q_n - q_1 = \epsilon\left(\sum_{i=1}^{n} p_i\right) - \epsilon\left( \frac{p_1+p_n}{2} \right) + O\left(\epsilon^2\right)

Assuming trivial metric we have

\rho=\int_0^T p dt = \int_0^T \dot{q} dt = q_n - q_1 + O\left(\epsilon^2\right)

where last equality follows because the Störmer-Verlet leapfrog integrator is second order. The correct second-order approximation is not \rho \approx \sum_{i=1}^n p_i but rather

\rho \approx \frac{1}{2}p_1 + \left(\sum_{i=2}^{n-1} p_i\right) + \frac{1}{2}p_n

@betanalpha I am with @nhuurre on this. Your paper https://arxiv.org/abs/1304.1920 mentioned that In practice, the integral can be approximated by aggregating the momenta along discrete time intervals, tk−t{k-1}=δtk, such as those naturally introduced in any discrete integrator. That statement matches my understanding of the situation. Now I read your comment as: there will be an extra term in computing an integral when using symplectic integrator. I think I lack background here (I couldn’t figure the math out, maybe something like biased/unbiased estimation in computing variance is happening under the scene) so I leave the justification of current implementation to other people.

Updated performance tests with the recent bug fix show slightly different behavior but the general story is the same – the modification stabilizes the numerical trajectory lengths and improves effective sample size per leapfrog step for all of the second moments and most of the first moments.

extra_checks_report2.pdf (488.5 KB)

1 Like

Let me establish that the computation of \rho is not in question here, and q_{N} - q_{1} and is not relevant to discussion at all.

The current sampler is not trying to estimate or otherwise compute q_{N} - q_{1} for exact trajectories or numerical trajectories. The difference in positions appeared in the original No-U-Turn sampler as a heuristic based on the special geometry of the real numbers. The difference between the end points of exact trajectories doesn’t generalize beyond the real numbers. Moreover the difference between the end points of a numerical trajectory isn’t well-defined because of its dependence on a particular discretization.

The integrated momenta,

\rho = \int_{0}^{T} \mathrm{d} t \, p(z(t)),

however, is a proper geometric object. Take care as this is not a regular integral, both because of the implicit z(t) dependence but also because p and \rho are not real numbers but rather more general geometric objects. On Euclidean manifolds, where the No-U-Turn heuristic is well-defined, and exact trajectories these two definitions give the same termination criteria. The integrated momenta, however, generalizes to any manifold and hence is better suited for Hamiltonian Monte Carlo. In other words the original No-U-Turn criterion always approximated the new one, not the other way around.

For any function over phase space the natural estimator of its temporal expectation is the empirical average of evaluations over each discrete state,

\int_{0}^{T} \mathrm{d} t \, f(z(t)) = \frac{1}{N} \sum_{n = 1}^{N} f(z_{n}).

These estimators are unbiased as N \rightarrow \infty and enjoy nice convergence properties for finite N due to the nature of symplectic integrators.

To approximate the integrated momenta we just utilize the usual estimators for temporal expectations giving the current algorithm. For the very technically minded when computing \rho in practice we compute its components in a coordinate system, which are given by D functions defined as the compositions of the canonical momenta operator p: T^{*}Q \rightarrow T_{q}^{*} Q with local cotangent space coordinate functions, p_{i} : T_{q}^{*} Q \rightarrow \mathbb{R}^{i}.

The difference between the estimator for \rho and the value of q_{N} - q_{0} for a numerical trajectory is irrelevant because we are not trying to estimate the distance between the end points anymore. The difference between q_{N} - q_{0} for exact and numerical trajectories also doesn’t matter, nor does the difference between \rho or estimates of \rho and those two Euclidean distances. The distances are an artifact of a the original heuristic from which we have long since moved passed.

The current dynamic termination criterion is based on the geometric calculation p^{\sharp} \cdot \rho, and we approximate that second term using a numerical trajectory generated by a symplectic integrator per standard practice.

2 Likes