Acceptance ratio in NUTS

Hi there,
I have a question related to Feature/2789 improved stepsize adapt target by betanalpha · Pull Request #2836 · stan-dev/stan · GitHub, where the computation of the step size adaptation target target was modified. But the PR is later reversed and I am not able to find discussion related to that. Is there any analysis showing this is not a good change?

1 Like

It was reverted simply because the devs couldn’t agree if it had been tested adequately and then the release deadline happened. No concrete problems were known at the time.
I not aware of any discussion since. my understanding is that people have been disinclined to touch the subject because the original discussion got heated towards the end and was closed by the moderators.

It’s a good question, though, and in fact I’ve been meaning to re-open the issue for a while now. The stepsize adaptation target can surely be improved but I haven’t gotten around to testing the possibilities.

So far, I did find a model with surprising behaviour.

parameters {
  vector[5] x;
}
model {
  x ~ std_normal();
  target += (x[1] > 0) ? 0 : -10;
}

The model is clearly pathological because of the discontinuity and indeed the stepsize crashes during (current) adaptation. But the Boltzmann-weighted adaptation target seems to ignore the discontinuity and maintains a large stepsize. (But maybe that’s as intended because the model does sample efficiently?)

5 Likes

This is not the full story, but it’s not worth going through here. Unfortunately developer governance has been kicked by the last few governing bodies and there is unlikely to be any resolution until then.

No problems were ever suggested, let alone demonstrated. Indeed from the beginning the change was motived by the rich theory of Hamiltonian Monte Carlo which was then empirically verified in a variety fo examples.

The weighted adaptation and the current adaptation both assume that the symplectic integrator is stable, which precludes divergences and discontinuities. In the original paper we even showed how the adaptation target breaks down in the presence of unstable/inaccurate integration.

In other words neither adaption routine guarantees any behavior in the presence of a discontinuity. Because of that “surprise” isn’t the right word here as that presumes some expected behavior!

That said it’s relatively straightforward to understand why the two adaptation targets behave differently for this particular model.

Both adaptation target constructs an “adaptation statistic” by averaging over the hypothetical Metropolis acceptance probability at every point in each numerical trajectory. For any trajectory that crosses the boundary at x[1] = 0 this requires averaging over points on both sides of the discontinuity. If the trajectory starts to the left of the discontinuity then the hypothetical acceptance probabilities of all points on the same side will be smallish while all of those on the right side of the discontinuity will be much larger. On the other hand if the trajectory starts to the right of the discontinuity then points on the same side will have largish acceptance probabilities and points on the other side will have much smaller acceptance probabilities.

The current adaptation averages the acceptance probabilities at each point uniformly, which always weights down the resulting acceptance statistic unless a numerical Hamiltonian trajectory never passes the discontinuity at all. The better adaptation, however, gives those points with smaller energy errors a higher weight which then suppresses the lower acceptance probabilities. In other words it doesn’t care if half the trajectory has points with low acceptance probabilities if half have high acceptance probabilities because it’s going to end up using those latter points anyways!

For this one-dimensional problem the multinomial sampling might actually compensate for the discontinuity enough to result in decent finite-iteration estimation even though the numerical Hamiltonian trajectories themselves aren’t exploring as much as they should if they actually accounted for the jump.

The step size adaptation is probably crashing because the discontinuity leads to a kind “gap” in the adaptation statistic. Because of the symmetry of the target density each trajectory will have a similar number of points on each side of the discontinuity and the adaptation statistic will be close to 0.5 no matter the step size. The dual averaging will keep trying to reduce the step size to increase the adaptation statistic with no luck, ending up in that fatal spiral. The better adaptation statistic, however, essentially filtered out the bad points so that the adaptation is just trying to tune for one side of the discontinuity which is why it avoids the problem. But again this “gap” arises only because of the discontinuity which invalidates the expected smooth relationship between adaptation statistic and cost on which the adaptation is based.

Yeah I think it’s worth revisiting this.

We do have a voting procedure now, which certainly isn’t perfect but is more than we used to have. I don’t think the voting system was in place when this conflict first happened. It’s possible that we’d still run into unresolvable issues if this gets revisited but perhaps it’s worth trying at some point.

1 Like

If we’re going to go down this path we should probably do this more formally to make sure that the details of the history and technical issues are clear. In particular we did have policies and a voting system in place set up by the Technical Working Group director at the time, and those policies were followed in the PR. The problems arose when those policies were challenged and we just weren’t set up to deal with the situation.

The current voting procedure with somewhat vague guidelines also has vulnerabilities in technical matters like this, which I’ve commented on in some of the there governance threads. It would be great to hammer these governance issues out as voting on subtle mathematical/technical issues is only going to become a more significant issue in the future.