NUTS misses U-turns, runs in circles until max_treedepth

I have a nice test harness in place and have results for a reasonably optimal transition that adds in the red checks below,

The results are consistent with expectations, but sadly are not a complete improvement and so will require making some nontrivial choices.
extra_checks_report.pdf (307.7 KB)

For the IID normal model we see the “resonant” behavior where the trajectory misses the no-u-turn criterion turning on and gets stuck processing around the level set for a long time until it gets lucky enough to terminate. The variant code adds the aforementioned checks which indeed prevent this problem and significantly stabilize the performance.

Unfortunately these additional checks have adverse affects on linearly correlated models. For these models the no-u-turn criterion is already too aggressive, terminating at high frequency wiggles long before they can stretch out across the full level set. As the dimension increases there is more room and less likelihood of an early termination, which is why the effective sample size surprisingly increases with dimension initially. Adding the additional checks, however, only makes the premature termination worse and uniformly decreases performance. The performance drops to between 75% and 50% of nominal values.

Similarly the heavy tailed models (for smaller nu) require long trajectories to explore the tails, and the extra checks cut off these trajectories before they have adequately explored. Again the performance decreases is pretty stable around 50%.

I have not yet convinced myself which way we want to proceed. Making this change brings more stability and improved performance for the IID normal at the expensive of the performance for non-normal models. Leaving the code as is leaves us with instability for the normal iid model (as a function of dimension) while maintaining the better performance for non-normal models.

2 Likes

Cool!

Can you add corresponding ESS results for x^2 and horizontal line at the number of post-warmup iterations? (Or provide the test harness, and I can run that if you are busy).

My point is that maybe we should not focus too much on trying to get antithetic behavior for x and maybe 1.5-2.5 x boost in ESS for x, if that means much lower ESS/second for x^2. Your test harness would be great for demonstrating the size of that effect, too.

Good point – the number of samples here is 10,000 for each run so that is some wicked antithetical behavior. I’ll have to do a little bit of work to add the squares and separate them out into their own data for separate plots but I should be able to get something by tomorrow.

extra_checks_report.pdf (488.7 KB)

This looks pretty conclusive. In addition to looking at the effective sample sizes for x^{2}'s in addition to x's I turned off metric adaptation (for each of these models the optimal scales are the default ones) to stabilize the runs (like the step size the metric adaptation can be a bit noisy and obscure the differences between the variants).

The additional checks are uniformly better than or equal to the nominal results for the x^{2}'s in effective samples per gradient calculation, even for the correlated and heavy tailed targets, and in the few instances where the nominal is better for the x's the difference is small.

I’ll clean up the code and work through updating all of the tests before submitting a PR in the next few days. Then we can update the step size adaptation criterion to the new acceptance statistic and we’ll have a shiny new sampler variant for 2.21.

5 Likes

@nhuurre the 2.21 release is October 18th if you want to try experimenting with anything else before then. Not that there’s any rush.

And the pull req. is here: https://github.com/stan-dev/stan/pull/2800 if you want to make comments.

I’m just excited to see this moving forward.
The only comment I have is that the PR code is more complicated than what I experimented with because it tracks p_fwd_bck and p_bck_fwd in addition to p_sharp_fwd_bck and p_sharp_bck_fwd. I don’t know how much that matters but it does make it look more like a proper u-turn check.

The p_fwd_back and p_bck_fwd are needed to make the checks reversible to maintain the invariance of the target distribution. Specifically the 'rho used in each check has to span the full sub trajectory being evaluated which requires appending on one of those two boundary momenta. I have verified empirically that the current PR has the right invariant distribution to high accuracy).

Yeah, I figured it was something like that. Some days I feel like I understand the detailed balance, some days I don’t. I thought it was sufficient that the check treated the subtrees symmetrically.

One has to be careful because the checks need to be symmetric with respect to all possible starting points. In other words, if one can generate a trajectory from one point without failing the no-u-turn criterion then one has to be able to generate trajectories from all other starting points without failing the no-u-turn criterion.

Not including the end point momentum in rho for the extra checks might actually end up satisfying the requisite symmetry, although that would still leave the problem of not actually checking the proper no-u-turn criterion because the summed momenta doesn’t span the entire trajectory.

In any case there doesn’t seem to be any performance loss for the more proper check so there isn’t any reason to do it right.

This is what keeps tripping me up.

In my experiments I wasn’t trying to do a proper u-turn check but only to detect the wraparounds. In that case compute_criterion(p_sharp_bck_fwd, p_sharp_fwd_bck, rho) is enough to guarantee that the usual no-u-turn check won’t miss the u-turn.

I agree the overhead is negligible either way.

I don’t think anyone can intuit this, which is why I’m careful to thoroughly check the resulting sampling behavior.

@nhuurre I am not sure if it helps but I think the criterion is not to check the angles of p_1 vs sum(p_1,..., p_n) or p_n vs sum(p_1,..., p_n) but to check the angles of p_1 vs sum(p_2,..., p_n) and pn vs sum(p1,..., p_{n-1}). My reason is q_n - q_1 ~ sum(p_2,..., p_n) * step_size, not q_n - q_1 ~ sum(p_1,..., p_n) * step_size. I use this criterion and do not observe your topic’s issue. Regardless, I hope that the discussions here will lead to a better criterion for Stan. :)

The sum is just a first order approximation to the integral. Omitting one momentum amounts to a difference that is proportional to the step size. So it’s still a first order approximation. I don’t have intuition which one is more accurate or if there even is a consistent difference. However, q_n-q_1 should be the same whichever direction you’re going so I think if you’re going to do that you should instead use the average of forward and backward sums which is

q_n - q_1 \approx \frac{1}{2}p_1 + \sum_{k=2}^{n-1}p_k + \frac{1}{2}p_n

Again, no idea if that \frac{1}{2}\left(p_1+p_2\right) difference improves the approximation. Maybe looking at how it works when the trajectory is just a single step will clarify this?

So we start at q_1 with momentum p_1 . The first momentum half-step takes us to

p^\prime = p_1 + \frac{1}{2}f_1

where f_1 is the force at q_1 and I have chosen units such that the stepsize is one. Next position full-step moves us to

q_2 = q_1 + p^\prime = q_1 + p_1 + \frac{1}{2}f_1

and finally the second momentum half-step completes the transition

p_2 = p^\prime + \frac{1}{2}f_2 = p_1 + \frac{f_1+f_2}{2}

Rearranging these gives

\begin{eqnarray*} p_2 - p_1 & = & \frac{f_1+f_2}{2}\\ q_2 - q_1 & = & p_1 + \frac{f_1}{2} = p_2 - \frac{f_2}{2}\\ & = & \frac{p_1+p_2}{2} + \frac{f_1 - f_2}{4}\\ \end{eqnarray*}

Actually, I now see that this can be summed over a trajectory of arbitrary length.

\begin{eqnarray*} q_{n}-q_{1} & = & \left(q_{n}-q_{n-1}\right)+\left(q_{n-1}-q_{n-2}\right)+\dots+\left(q_{2}-q_{1}\right) \\ & = & \frac{1}{2}p_{1}+\sum_{k=2}^{n-1}p_{k}+\frac{1}{2}p_{n}+\frac{f_{1}-f_{n}}{4} \end{eqnarray*}

Huh, that’s surprisingly simple. Can it really be this easy?

2 Likes

Yeah, I think that your approximation might be more accurate. But assume that there are only two points, then to me, computing p1.p2 to decide U-turn is better than computing p1.(p1+p2)/2 and p2.(p1+p2)/2. Anyway, using your approximation might also solve the topic issue but I haven’t tested it yet. The point I want to make here is there is an additional term, whether it is p1 or pn or your proposal (p1+pn)/2, in the approximation of current Stan criterion, which causes your issue.

The difference between the two is actually proportional to the square of the step size, which you can see if you restore the step sizes in your calculation.

But you have to be careful about what is the approximating what. The original no-u-turn condition, p_{N} \cdot (q_{N} - q_{0}) > 0, (-p_{0}) \cdot (q_{0} - q_{N}) > 0, is well-defined only on Euclidean manifolds and hence it’s particularly well-defined in terms of the Hamiltonian system itself. On the other hand the integral of the canonical form (sum over momenta) introduced in https://arxiv.org/abs/1304.1920 is well-defined with respect to the Hamiltonian system and doesn’t rely on the particular structure of a given Euclidean space.

The two are equal in the \epsilon \rightarrow 0 limit, but ultimately the original no-u-turn criterion is what approximates this sum generalized criterion for a finite step size integrator, and not the other way around.

One immediate benefit of this more general condition is that it is immediately symmetric between states, so the same sum can be used for both the forward and reverse checks. Removing the boundary momenta results in an asymmetric sum that will break the requisite symmetry is used naively. To ensure the right invariant distribution one would have to either keep all of the trajectory states around to compute the asymmetry sums on the fly or keep boundary states around to correct the total sum, which isn’t much different to what the current PR does.

The advantage of the current PR, and the current behavior in develop, is that it is more immediately connected to the proper theory and hence easier to demonstrate the desired invariant distribution. That connection to theory also helps to elucidate what effects the discrete trajectories are introducing and motivate principled understanding for how to manage them.

The dot product becomes negative as soon as the angle between the vectors exceeds 90 degrees which is not a u-turn. To detect when the turn is near 180 degrees you need to compare the momenta not to each other but to some kind of a “pivot vector” that lies between them. I don’t think p1.(p1+p2)/2 is too strict. When sampling from a two dimensional IID normal Stan does limit about 15% of the transitions to a single leapfrog step so the current algorithm has no problem stopping at short u-turns. Limiting max_treedepth to 1 or even 2 seriously lowers effective sample size so I think even 15% is quite a lot.

If I restore the step sizes I get

\begin{eqnarray*} q_{n}-q_{1} & = & \frac{1}{2}\epsilon p_{1}+\sum_{k=2}^{n-1}\epsilon p_{k}+\frac{1}{2}\epsilon p_{n}+\epsilon^2\frac{f_{1}-f_{n}}{4}\\ & = & \sum_{k=1}^n\epsilon p_{k}-\epsilon\left(\frac{1}{2}p_{1}+\frac{1}{2}p_{n}\right)+\epsilon^2\frac{f_{1}-f_{n}}{4}\\ & = & \epsilon\sum_{k=1}^n p_{k}+O\left(\epsilon\right) \end{eqnarray*}

So no, the rho used by the u-turn criterion is a first order approximation to q_n-q_1 (or vice versa). It is the symmetrically shortened sum that is a second order approximation.

I can’t help but digress here. The above calculation suggests a nice visualization for the leapfrog trajectories. Positions q_k are represented by black dots in the obvious way. The force at position q_k is drawn as a vector that starts from q_k and has length \frac{1}{2}\epsilon^2 f_k . The momentum is then drawn as a vector \epsilon p_k placed such that its center coincides with the center of the corresponding force vector. It then happens that the momenta line up end-to-end and create a contiguous “spine” for the trajectory.

Circles are boring so I’ll use inverse square potential for the picture. (this potential is of course pathological due to the infinite peak in the middle and the improper flat tail but let’s not worry about those regions.)
Here is a trajectory for a two dimensional \frac{-1}{2\left|q\right|^2} potential starting at q=(1.5, 0) and p=(0, 0.66).

Symplectic geometry is more than just the position space. To see what the trajectory looks like in the momentum space move all the momenta to the origin while keeping the force vectors attached to their respective momenta. Now the forces create a contiguous path!

With all the relevant quantities in the same picture it’s possible to visualize different u-turn conditions. The original u-turn condition stops the sampler when the line connecting the first and last position makes an acute angle with at least one endpoint momentum.

uturn1

The condition in Stan 2.20 uses a line connecting the ends of the momentum spine.

uturn2

@fehiepsi proposed something like

uturn3

Indeed. I’m not even sure anymore if you can say it is more “complicated.” It has couple of variables more but their purpose is absolutely clear. Any proposed alternative needs a solid theoretical justification first.

3 Likes

@nhuurre your visualization is pretty nice! Thanks for making it! I don’t have much opinion on whether we approximate an integral by using starting points of each bin, ending points of each bin, or your proposal middle points of each bin. I just want to point out that there is an extra term in the approximation of the integral of current Stan implementation (which seems incorrect to me), and it might cause your topic issue because I didn’t observe the issue in my implementation.

@betanalpha I don’t know how complicated it is in Stan implementation, but in my implementation, I just compute (p_sum - p_left) * p_left and similar for the right point (I ignored mass matrix here for simplicity). For @nhuurre proposal, we just need to compute (p_sum - p_left/2 - p_right/2) * p_left and similar for the right point. No need to store any thing else, and these criterions are symmetric. Maybe I misunderstand your point?

Check out the pull request @betanalpha made for the details. I think Stan’s binary recursion means your change has to be implemented in similar fashion. What is your implementation anyway, can you share the code?

@nhuurre you can find the code here (NUTS is implemented in iterative fashion there but I think that the turning condition would be the same with recursive style). I made this gist to print out n_leapfrog but you can explore more if you want. The functional pseudo-random number generator mechanism allows the result be reproduced (even across multi-chains) in your system.

Thanks! I can confirm that NumPyro does not suffer from the problem. Based on some quick testing it looks like the shorter criterion does not impact effective sample sizes in two dimensions (contrary to my claim) and in general makes the chains somewhat less antithetic much like Stan#PR2800.

NumPyro stores p_left and p_right and then multiplies those by the inverse mass matrix every time it wants to check for a u-turn. Stan#develop first multiples ps by the inv-mass matrix and then stores only p_sharps. Stan#PR2800 stores ps (like NumPyro) but also has the precomputed p_sharps. I think all these checks are so fast compared to gradient calculation that arguing over micro-optimizations isn’t worth it. Unlike Stan, NumPyro’s shorter u-turn criterion does not need to be augmented by additional internal checks.

What I’ve had in mind is to keep Stan#develop’s u-turn criterion as it is and add an extra “convexity” check that looks like this:

convex

The green triangle is rho_fwd, rho_bck and rho.
The red line is orthogonal to rho and the criterion is that p_fwd_bkc must lie on the same side of the red line as rho_fwd. (and symmetrically for p_bck_fwd and rho_bck) I believe this is equivalent to

p_sharp_fwd_bck.dot(rho) > 0 && p_sharp_bck_fwd.dot(rho) > 0

The above trajectory continues growing but after the doubling the convexity condition triggers (just barely):

In this picture the u-turn criterion also triggers but notice that the convexity criterion doesn’t care if the ends of the spine overlap and thus can resolve the potential pathology.