NUTS misses U-turns, runs in circles until max_treedepth

It’s one of two ways to build the set of trajectories to sample from outlined in: https://arxiv.org/pdf/1701.02434.pdf and https://arxiv.org/pdf/1601.00225.pdf

Takes a little to digest cause there’s lots of parts, but you should go through it cause how you build the tree and what U-turn checks you need are super connected.

Well don’t dig too far back in my post history haha.

Nice plots. I like them. Did you actually get a 2d example to do this? Or is this picking a couple dimensions out of a much higher dimensional problem?

I guess the rhos that Betancourt is talking about are the sums of momentums. But change in position position is the integral of momentum, and just summing the momentums with a fixed stepsize along a leapfrog trajectory is going to give you a good approximation to that change in position up to a scaling factor, and we only care about directions.

So all the red arrows point in the same direction as a vector between the unconnected endpoints. Given that and the fact that I don’t know the Riemannian stuff, I’ve happily stuck with the old U-turn criteria. The differences are in A.4.2 here: https://arxiv.org/pdf/1701.02434.pdf

My buddy @arya wrote a NUTS implementation in R for experimenting with this sorta algorithmic stuff. I’ve been working on it some more. Haven’t gotten it finished yet (it seems to work – just badly organized at this point), but it takes in Stan models/data, and the NUTS implementation as written is about 200~ lines of non-recursive R that makes it easy to get debug info out :D. Gotta review a couple pull reqs first, but I’ll try to get this up within a month (hardly soon, but sigh).

Rewriting one for yourself might be worth it too since you’re getting in deep on this. I think I was down on rewriting it when Arya originally did it cause it looked so complicated, but Betancourt did a good job writing stuff up so it isn’t so bad in the end.

Yes yes

It’s 2d. Theoretically higher dimensions look the same, just rotated randomly. I didn’t see an easy way to pick the directions of interest in higher dimensions so giving a 2d model a bad stepsize (0.43) was the fastest option. Dimensionality matters only if you let the model choose its own stepsize.

Yes.

I thought so too (cf. “momentum polygons” upthread) but apparently that approximation is quite rough.

That means rho is going in the same direction as endpoint momenta; NUTS thinks the trajectory is straight. Since the final point is behind the initial one a distance vector would go in the opposite direction. But these pictures aren’t all that important as they come from an abnormally large stepsize. The smaller stepsize picture looks as expected.

Really cool!

Too deep, I’d say. Yesterday I spent way too much time staring at the NUTS implementation in Chi Feng’s MCMC Gallery with stepsize 0.43 just see if the problem would show up. (It did.) I was planning on taking a break because I’ve been taking this discussion too personally and @betanalpha already understands the bug better than I do.

1 Like

Nice! This is the picture I had in my head, and one of the first exercises I was going to do in the testing. You have to be careful when drawing even cartoon trajectories because symplectic integrator trajectories lie on closed surfaces – they can’t spiral in or out which significantly constrain their behaviors.

At the same time being on a closed surface (which can be showed to be a perturbation of the exact energy level set) doesn’t mean that the discrete trajectories will close, unless you can set the step size to something that evenly divides the period on that modified level set.

In other words this is pretty much the expected behavior, at least for the one-dimensional Gaussian target.

It all depends on the step size. The reason the two aren’t exactly the same is that the symplectic leapfrog integrator that we use is a second-order method and the change in positions isn’t just a scaled version of the final momenta after each update.

Nice!

I just want to emphasize that this has been a great, constructive conversation, so thanks to all involved. Please feel free to continue – I’ll follow along but won’t be able to really put together rigorous tests until I get back home from traveling/teaching/galavanting in a few weeks.

Lol, 0.425 was the stepsize where I was able to get the radon model to go crazy on me. Wonder if that’s a coincidence or not. That’s cool that it’s working in a 2D model.

Ditto. The internet is intimidating.

Those animations are great

That’s curious. The effect is really sensitive to stepsize and what constitutes a bad one doesn’t depend on dimension but it definitely depends on the width of the distribution. I wouldn’t expect a real model give a unit normal posterior even if you data is scaled to approximate that… Ah, it depends on the mass matrix too, and if your adaptation is perfect then that cancels out the variance matrix! Bad stepsizes should be universal for nearly-normal models after all!

Interested amateur alert! :)

As I was reading this thread, I wondered if what @nhuurre indicated is true: The step size is the culprit, and it can happen to any model. Also, if the problem is that while 1 tree yields a good step, but the second tree yields a U-turn relative to the first, won’t we eventually encounter a case where it happens over 3 or 4 trees instead of 2 and have what is in essence the same problem (perhaps with a step size of 0.215 in this example to get a 4-tree return to the same location).

Might another solution be to have a random variability on the step size, for instance a uniform distribution around ±10% of the nominal step size (with user control over the 10% value)? That seems like it would move away from the u-turn relatively quickly just based on random variability in the step not allowing loops to form.

If that is a reasonable solution, it would have the benefit that the paired tree tests would not be necessary.

A bit late for the alert. Half the posts have been written by an interested amateur. :)

So far I’ve been looking at IID normal models. These are very symmetrical and I think that’s an important factor here. A more irregular model might have trajectories that don’t loop like that. I don’t know.
But then again the normal distribution is often the default choice and most good models are approximately normal if you have enough data.

Yes, it can happen with longer trajectories too. But note that the u-turn condition is checked at every power-of-two steps. If it’s not half-circle to full-circle transition then an intermediate check will catch the u-turn.

Stan has a stepsize jitter option (which is off by default). It won’t help much because the stepsize still stays constant during a single trajectory. Unfortunately the leapfrog integrator cannot handle within-trajectory variation in stepsize.

1 Like

I think (sorry for hasty typing, but I’ll be away for several days, and wanted to add this before leaving):

The current memory requirements are due to recursive implementation. In non-recursive approach in any moment we need to store (unless I’m mistaken)

  • z_0 (previous draw) so that we can compute U-turn criterion
  • left and right end points to be able to continue simulation left or right
  • the point we would accept if we would decide to stop

The new point in dynamic simulation can be considered as addition to the tree, but after it has added we can collapse the information to the above. At any time we have information which point would be accepted given the simulation so far and the total energy so far, and we can make the probabilistic decision whether the new addition is the new accepted point given the old and new parts. I guess Ben can implement this soonish with his non-recursive R code.

Given that we don’t need to worry about memory requirements (as by 1), we can check the U-turn criterion more often and infinite loops would be much more unlikely. Assuming I’m mistaken in memory requirements, we could instead of adding 100% more, add e.g. 50% more and still have better memory behavior than if adding constant number of steps.

My hypothesis is that if we want to maximize ESS/n_leapforg for E[X] and E[x^2] we should stop earlier than at U-turn. I now realize that part of my previous experiments were ruined by infinite loops, but the last figure in @nhuurre’s first post supports this hypothesis. The figure shows ESS (Neff) for x and lp, but in the case of normal distribution lp is proportional to x^2. Increasing step size makes n_leapfrog to decerase until step size 1, plateaus and then later decreases again. By eye, it seems the highest ESS/n_leapfrog for x^2 is around stepsize 1.0 while the highest ESS/n_leapfrog for x is around stepsize 1.4. Similarly longer simulation without u-turn increase n_leapfrog and increase ESS for x, but decrease ESS/n_leapfrog for x^2.

Nope. That’s all you need to build a tree to a given length, but not enough to make all of the necessary checks to ensure the the sampling of an entire trajectory is sufficiently reversible to admit the target invariant distribution. The beauty of the recursive implementation is that is makes all of the necessary checks with as little memory as possible.

Yeah, so this showed up in a hierarchical model here: Request for Volunteers to Test Adaptation Tweak

I think that model has 300+ parameters?

@avehtari has been talking about making the uturn criteria less stringent for a while. I think he’s seen suspicious behavior in other models. We were specifically looking at the uturn criteria here since it seemed like something interesting was happening. The clearest plot I think was to plot, for every uturn check done as part of the binary tree-building, the minimum of the two uturn criteria (there are two checks done between every two pairs of points) along with the number of leapfrog steps in that path.

The way things are plotted here going below 0.0 means uturn (the scores are dot products of normalized vectors – so take the inverse cosine to get the angle between them):

scores_vs_distance

These are the uturn criteria checked in one trajectory building. Soon after hitting a doubling to a length 128 trajectory, we found a uturn and jumped out. There are actually two checks here that count as uturns – normally there’d just be one before you break the loop but the way I wrote the debug code you might get two. That’s just an artifact or whatever.

Anyway you see that with 8 leapfrog steps we’re neeeeearly U-turning. In fact, when we keep building and building the tree we eventually do get an 8 leapfrog U-turn. But for 16 leapfrog steps and 32 we’re quite a bit worse off according to the Uturn criteria at least.

So @avehtari was talking about making the uturn criteria something smaller. So maybe not a full 90 degree turn. What we’ve talked about in this thread is adding extra checks of intermediate length (as low as just +1 to our previous checks). I think we could also do something based on the collection of uturn criteria here as well. Like, if we do a tree doubling and our uturn criteria for the combined tree is worse than the criteria for both of the subtrees, bail. And maybe mix in a threshold parameter too.

Code for making that plot coming eventually.

@nhuurre, this is the NUTS implementation that wraps around Stan models so you can experiment with them in R. You think this might be useful to you? If so give it a whirl and see if it works okay. I’m tagging @arya (he started this code) so he can see how it goes.

It’s broken up into two packages. First is this thing, hamwrapper, that wraps up Stan gradients. Install that with:

devtools::install_github("bbbales2/hamwrapper")

This could be done with just the gradients exposed by Rstan, but @arya started this working on a project that needed higher order autodiff which is why the extra code.

The next package is RNUTS. I think it’s better to just check this one out as an RStudio project. The git url is https://github.com/bbbales2/RNUTS.git. So make a project with that, and then control + shift + B to build and install it. The Hadley R book is a quick skim to see how to work with packages: http://r-pkgs.had.co.nz/. This is my first attempt at doing R packages so it’s probly pretty bad. I thought this would be a good way to organize stuff though.

There’s a file in the root of the package (I don’t think files are supposed to be here but whatever), rnuts_example.R, that shows how to use this along with a little bit of what I think might be helpful plotting code.

The main code is in R/nuts.R. Navigate down to “oneSampleNuts” and you can look through how it works. Anyway tear it up and try whatever.

There’s one unit test that makes sure that the algorithm is roughly sampling a single normal random variable correctly and with some tiny bit of efficiency. There are probably bugs here. I didn’t exhaustively test this.

Here are a couple plots produced from the example file:

scores

Here’s another. Check the color scale here. It’s kinda weird, but I wanted to be able to differentiate between something be close to uturning but not uturning and something that just barely uturned (so there’s kinda of a discontinuity in the color gradient near zero, and < 0.0 is the uturn condition).

checks

Hopefully that’s useful

2 Likes

I don’t know how useful it is, but the visualization of angles in the binary tree form is super cool.

1 Like

I have a nice test harness in place and have results for a reasonably optimal transition that adds in the red checks below,

The results are consistent with expectations, but sadly are not a complete improvement and so will require making some nontrivial choices.
extra_checks_report.pdf (307.7 KB)

For the IID normal model we see the “resonant” behavior where the trajectory misses the no-u-turn criterion turning on and gets stuck processing around the level set for a long time until it gets lucky enough to terminate. The variant code adds the aforementioned checks which indeed prevent this problem and significantly stabilize the performance.

Unfortunately these additional checks have adverse affects on linearly correlated models. For these models the no-u-turn criterion is already too aggressive, terminating at high frequency wiggles long before they can stretch out across the full level set. As the dimension increases there is more room and less likelihood of an early termination, which is why the effective sample size surprisingly increases with dimension initially. Adding the additional checks, however, only makes the premature termination worse and uniformly decreases performance. The performance drops to between 75% and 50% of nominal values.

Similarly the heavy tailed models (for smaller nu) require long trajectories to explore the tails, and the extra checks cut off these trajectories before they have adequately explored. Again the performance decreases is pretty stable around 50%.

I have not yet convinced myself which way we want to proceed. Making this change brings more stability and improved performance for the IID normal at the expensive of the performance for non-normal models. Leaving the code as is leaves us with instability for the normal iid model (as a function of dimension) while maintaining the better performance for non-normal models.

2 Likes

Cool!

Can you add corresponding ESS results for x^2 and horizontal line at the number of post-warmup iterations? (Or provide the test harness, and I can run that if you are busy).

My point is that maybe we should not focus too much on trying to get antithetic behavior for x and maybe 1.5-2.5 x boost in ESS for x, if that means much lower ESS/second for x^2. Your test harness would be great for demonstrating the size of that effect, too.

Good point – the number of samples here is 10,000 for each run so that is some wicked antithetical behavior. I’ll have to do a little bit of work to add the squares and separate them out into their own data for separate plots but I should be able to get something by tomorrow.

extra_checks_report.pdf (488.7 KB)

This looks pretty conclusive. In addition to looking at the effective sample sizes for x^{2}'s in addition to x's I turned off metric adaptation (for each of these models the optimal scales are the default ones) to stabilize the runs (like the step size the metric adaptation can be a bit noisy and obscure the differences between the variants).

The additional checks are uniformly better than or equal to the nominal results for the x^{2}'s in effective samples per gradient calculation, even for the correlated and heavy tailed targets, and in the few instances where the nominal is better for the x's the difference is small.

I’ll clean up the code and work through updating all of the tests before submitting a PR in the next few days. Then we can update the step size adaptation criterion to the new acceptance statistic and we’ll have a shiny new sampler variant for 2.21.

5 Likes

@nhuurre the 2.21 release is October 18th if you want to try experimenting with anything else before then. Not that there’s any rush.

And the pull req. is here: https://github.com/stan-dev/stan/pull/2800 if you want to make comments.

I’m just excited to see this moving forward.
The only comment I have is that the PR code is more complicated than what I experimented with because it tracks p_fwd_bck and p_bck_fwd in addition to p_sharp_fwd_bck and p_sharp_bck_fwd. I don’t know how much that matters but it does make it look more like a proper u-turn check.

The p_fwd_back and p_bck_fwd are needed to make the checks reversible to maintain the invariance of the target distribution. Specifically the 'rho used in each check has to span the full sub trajectory being evaluated which requires appending on one of those two boundary momenta. I have verified empirically that the current PR has the right invariant distribution to high accuracy).

Yeah, I figured it was something like that. Some days I feel like I understand the detailed balance, some days I don’t. I thought it was sufficient that the check treated the subtrees symmetrically.