# NUTS misses U-turns, runs in circles until max_treedepth

As I was reading this thread, I wondered if what @nhuurre indicated is true: The step size is the culprit, and it can happen to any model. Also, if the problem is that while 1 tree yields a good step, but the second tree yields a U-turn relative to the first, won’t we eventually encounter a case where it happens over 3 or 4 trees instead of 2 and have what is in essence the same problem (perhaps with a step size of 0.215 in this example to get a 4-tree return to the same location).

Might another solution be to have a random variability on the step size, for instance a uniform distribution around ±10% of the nominal step size (with user control over the 10% value)? That seems like it would move away from the u-turn relatively quickly just based on random variability in the step not allowing loops to form.

If that is a reasonable solution, it would have the benefit that the paired tree tests would not be necessary.

A bit late for the alert. Half the posts have been written by an interested amateur. :)

So far I’ve been looking at IID normal models. These are very symmetrical and I think that’s an important factor here. A more irregular model might have trajectories that don’t loop like that. I don’t know.
But then again the normal distribution is often the default choice and most good models are approximately normal if you have enough data.

Yes, it can happen with longer trajectories too. But note that the u-turn condition is checked at every power-of-two steps. If it’s not half-circle to full-circle transition then an intermediate check will catch the u-turn.

Stan has a stepsize jitter option (which is off by default). It won’t help much because the stepsize still stays constant during a single trajectory. Unfortunately the leapfrog integrator cannot handle within-trajectory variation in stepsize.

1 Like

I think (sorry for hasty typing, but I’ll be away for several days, and wanted to add this before leaving):

The current memory requirements are due to recursive implementation. In non-recursive approach in any moment we need to store (unless I’m mistaken)

• z_0 (previous draw) so that we can compute U-turn criterion
• left and right end points to be able to continue simulation left or right
• the point we would accept if we would decide to stop

The new point in dynamic simulation can be considered as addition to the tree, but after it has added we can collapse the information to the above. At any time we have information which point would be accepted given the simulation so far and the total energy so far, and we can make the probabilistic decision whether the new addition is the new accepted point given the old and new parts. I guess Ben can implement this soonish with his non-recursive R code.

Given that we don’t need to worry about memory requirements (as by 1), we can check the U-turn criterion more often and infinite loops would be much more unlikely. Assuming I’m mistaken in memory requirements, we could instead of adding 100% more, add e.g. 50% more and still have better memory behavior than if adding constant number of steps.

My hypothesis is that if we want to maximize ESS/n_leapforg for E[X] and E[x^2] we should stop earlier than at U-turn. I now realize that part of my previous experiments were ruined by infinite loops, but the last figure in @nhuurre’s first post supports this hypothesis. The figure shows ESS (Neff) for x and lp, but in the case of normal distribution lp is proportional to x^2. Increasing step size makes n_leapfrog to decerase until step size 1, plateaus and then later decreases again. By eye, it seems the highest ESS/n_leapfrog for x^2 is around stepsize 1.0 while the highest ESS/n_leapfrog for x is around stepsize 1.4. Similarly longer simulation without u-turn increase n_leapfrog and increase ESS for x, but decrease ESS/n_leapfrog for x^2.

Nope. That’s all you need to build a tree to a given length, but not enough to make all of the necessary checks to ensure the the sampling of an entire trajectory is sufficiently reversible to admit the target invariant distribution. The beauty of the recursive implementation is that is makes all of the necessary checks with as little memory as possible.

Yeah, so this showed up in a hierarchical model here: Request for Volunteers to Test Adaptation Tweak - #33 by bbbales2

I think that model has 300+ parameters?

@avehtari has been talking about making the uturn criteria less stringent for a while. I think he’s seen suspicious behavior in other models. We were specifically looking at the uturn criteria here since it seemed like something interesting was happening. The clearest plot I think was to plot, for every uturn check done as part of the binary tree-building, the minimum of the two uturn criteria (there are two checks done between every two pairs of points) along with the number of leapfrog steps in that path.

The way things are plotted here going below 0.0 means uturn (the scores are dot products of normalized vectors – so take the inverse cosine to get the angle between them):

These are the uturn criteria checked in one trajectory building. Soon after hitting a doubling to a length 128 trajectory, we found a uturn and jumped out. There are actually two checks here that count as uturns – normally there’d just be one before you break the loop but the way I wrote the debug code you might get two. That’s just an artifact or whatever.

Anyway you see that with 8 leapfrog steps we’re neeeeearly U-turning. In fact, when we keep building and building the tree we eventually do get an 8 leapfrog U-turn. But for 16 leapfrog steps and 32 we’re quite a bit worse off according to the Uturn criteria at least.

So @avehtari was talking about making the uturn criteria something smaller. So maybe not a full 90 degree turn. What we’ve talked about in this thread is adding extra checks of intermediate length (as low as just +1 to our previous checks). I think we could also do something based on the collection of uturn criteria here as well. Like, if we do a tree doubling and our uturn criteria for the combined tree is worse than the criteria for both of the subtrees, bail. And maybe mix in a threshold parameter too.

Code for making that plot coming eventually.

@nhuurre, this is the NUTS implementation that wraps around Stan models so you can experiment with them in R. You think this might be useful to you? If so give it a whirl and see if it works okay. I’m tagging @arya (he started this code) so he can see how it goes.

It’s broken up into two packages. First is this thing, hamwrapper, that wraps up Stan gradients. Install that with:

devtools::install_github("bbbales2/hamwrapper")


This could be done with just the gradients exposed by Rstan, but @arya started this working on a project that needed higher order autodiff which is why the extra code.

The next package is RNUTS. I think it’s better to just check this one out as an RStudio project. The git url is https://github.com/bbbales2/RNUTS.git. So make a project with that, and then control + shift + B to build and install it. The Hadley R book is a quick skim to see how to work with packages: http://r-pkgs.had.co.nz/. This is my first attempt at doing R packages so it’s probly pretty bad. I thought this would be a good way to organize stuff though.

There’s a file in the root of the package (I don’t think files are supposed to be here but whatever), rnuts_example.R, that shows how to use this along with a little bit of what I think might be helpful plotting code.

The main code is in R/nuts.R. Navigate down to “oneSampleNuts” and you can look through how it works. Anyway tear it up and try whatever.

There’s one unit test that makes sure that the algorithm is roughly sampling a single normal random variable correctly and with some tiny bit of efficiency. There are probably bugs here. I didn’t exhaustively test this.

Here are a couple plots produced from the example file:

Here’s another. Check the color scale here. It’s kinda weird, but I wanted to be able to differentiate between something be close to uturning but not uturning and something that just barely uturned (so there’s kinda of a discontinuity in the color gradient near zero, and < 0.0 is the uturn condition).

Hopefully that’s useful

2 Likes

I don’t know how useful it is, but the visualization of angles in the binary tree form is super cool.

1 Like

I have a nice test harness in place and have results for a reasonably optimal transition that adds in the red checks below,

The results are consistent with expectations, but sadly are not a complete improvement and so will require making some nontrivial choices.
extra_checks_report.pdf (307.7 KB)

For the IID normal model we see the “resonant” behavior where the trajectory misses the no-u-turn criterion turning on and gets stuck processing around the level set for a long time until it gets lucky enough to terminate. The variant code adds the aforementioned checks which indeed prevent this problem and significantly stabilize the performance.

Unfortunately these additional checks have adverse affects on linearly correlated models. For these models the no-u-turn criterion is already too aggressive, terminating at high frequency wiggles long before they can stretch out across the full level set. As the dimension increases there is more room and less likelihood of an early termination, which is why the effective sample size surprisingly increases with dimension initially. Adding the additional checks, however, only makes the premature termination worse and uniformly decreases performance. The performance drops to between 75% and 50% of nominal values.

Similarly the heavy tailed models (for smaller nu) require long trajectories to explore the tails, and the extra checks cut off these trajectories before they have adequately explored. Again the performance decreases is pretty stable around 50%.

I have not yet convinced myself which way we want to proceed. Making this change brings more stability and improved performance for the IID normal at the expensive of the performance for non-normal models. Leaving the code as is leaves us with instability for the normal iid model (as a function of dimension) while maintaining the better performance for non-normal models.

3 Likes

Cool!

Can you add corresponding ESS results for x^2 and horizontal line at the number of post-warmup iterations? (Or provide the test harness, and I can run that if you are busy).

My point is that maybe we should not focus too much on trying to get antithetic behavior for x and maybe 1.5-2.5 x boost in ESS for x, if that means much lower ESS/second for x^2. Your test harness would be great for demonstrating the size of that effect, too.

Good point – the number of samples here is 10,000 for each run so that is some wicked antithetical behavior. I’ll have to do a little bit of work to add the squares and separate them out into their own data for separate plots but I should be able to get something by tomorrow.

extra_checks_report.pdf (488.7 KB)

This looks pretty conclusive. In addition to looking at the effective sample sizes for x^{2}'s in addition to x's I turned off metric adaptation (for each of these models the optimal scales are the default ones) to stabilize the runs (like the step size the metric adaptation can be a bit noisy and obscure the differences between the variants).

The additional checks are uniformly better than or equal to the nominal results for the x^{2}'s in effective samples per gradient calculation, even for the correlated and heavy tailed targets, and in the few instances where the nominal is better for the x's the difference is small.

I’ll clean up the code and work through updating all of the tests before submitting a PR in the next few days. Then we can update the step size adaptation criterion to the new acceptance statistic and we’ll have a shiny new sampler variant for 2.21.

6 Likes

@nhuurre the 2.21 release is October 18th if you want to try experimenting with anything else before then. Not that there’s any rush.

And the pull req. is here: https://github.com/stan-dev/stan/pull/2800 if you want to make comments.

I’m just excited to see this moving forward.
The only comment I have is that the PR code is more complicated than what I experimented with because it tracks p_fwd_bck and p_bck_fwd in addition to p_sharp_fwd_bck and p_sharp_bck_fwd. I don’t know how much that matters but it does make it look more like a proper u-turn check.

The p_fwd_back and p_bck_fwd are needed to make the checks reversible to maintain the invariance of the target distribution. Specifically the 'rho used in each check has to span the full sub trajectory being evaluated which requires appending on one of those two boundary momenta. I have verified empirically that the current PR has the right invariant distribution to high accuracy).

Yeah, I figured it was something like that. Some days I feel like I understand the detailed balance, some days I don’t. I thought it was sufficient that the check treated the subtrees symmetrically.

One has to be careful because the checks need to be symmetric with respect to all possible starting points. In other words, if one can generate a trajectory from one point without failing the no-u-turn criterion then one has to be able to generate trajectories from all other starting points without failing the no-u-turn criterion.

Not including the end point momentum in rho for the extra checks might actually end up satisfying the requisite symmetry, although that would still leave the problem of not actually checking the proper no-u-turn criterion because the summed momenta doesn’t span the entire trajectory.

In any case there doesn’t seem to be any performance loss for the more proper check so there isn’t any reason to do it right.

1 Like

This is what keeps tripping me up.

In my experiments I wasn’t trying to do a proper u-turn check but only to detect the wraparounds. In that case compute_criterion(p_sharp_bck_fwd, p_sharp_fwd_bck, rho) is enough to guarantee that the usual no-u-turn check won’t miss the u-turn.

I agree the overhead is negligible either way.

I don’t think anyone can intuit this, which is why I’m careful to thoroughly check the resulting sampling behavior.

@nhuurre I am not sure if it helps but I think the criterion is not to check the angles of p_1 vs sum(p_1,..., p_n) or p_n vs sum(p_1,..., p_n) but to check the angles of p_1 vs sum(p_2,..., p_n) and pn vs sum(p1,..., p_{n-1}). My reason is q_n - q_1 ~ sum(p_2,..., p_n) * step_size, not q_n - q_1 ~ sum(p_1,..., p_n) * step_size. I use this criterion and do not observe your topic’s issue. Regardless, I hope that the discussions here will lead to a better criterion for Stan. :)

The sum is just a first order approximation to the integral. Omitting one momentum amounts to a difference that is proportional to the step size. So it’s still a first order approximation. I don’t have intuition which one is more accurate or if there even is a consistent difference. However, q_n-q_1 should be the same whichever direction you’re going so I think if you’re going to do that you should instead use the average of forward and backward sums which is

q_n - q_1 \approx \frac{1}{2}p_1 + \sum_{k=2}^{n-1}p_k + \frac{1}{2}p_n

Again, no idea if that \frac{1}{2}\left(p_1+p_2\right) difference improves the approximation. Maybe looking at how it works when the trajectory is just a single step will clarify this?

So we start at q_1 with momentum p_1 . The first momentum half-step takes us to

p^\prime = p_1 + \frac{1}{2}f_1

where f_1 is the force at q_1 and I have chosen units such that the stepsize is one. Next position full-step moves us to

q_2 = q_1 + p^\prime = q_1 + p_1 + \frac{1}{2}f_1

and finally the second momentum half-step completes the transition

p_2 = p^\prime + \frac{1}{2}f_2 = p_1 + \frac{f_1+f_2}{2}

Rearranging these gives

\begin{eqnarray*} p_2 - p_1 & = & \frac{f_1+f_2}{2}\\ q_2 - q_1 & = & p_1 + \frac{f_1}{2} = p_2 - \frac{f_2}{2}\\ & = & \frac{p_1+p_2}{2} + \frac{f_1 - f_2}{4}\\ \end{eqnarray*}

Actually, I now see that this can be summed over a trajectory of arbitrary length.

\begin{eqnarray*} q_{n}-q_{1} & = & \left(q_{n}-q_{n-1}\right)+\left(q_{n-1}-q_{n-2}\right)+\dots+\left(q_{2}-q_{1}\right) \\ & = & \frac{1}{2}p_{1}+\sum_{k=2}^{n-1}p_{k}+\frac{1}{2}p_{n}+\frac{f_{1}-f_{n}}{4} \end{eqnarray*}

Huh, that’s surprisingly simple. Can it really be this easy?

2 Likes