The requisite reading for what’s going on is https://arxiv.org/abs/1411.6669. There we showed that in the original Hybrid Monte Carlo scheme, where the numerical Hamiltonian trajectory is used as a Metropolis proposal, the optimal step size is the one that yields an average Metropolis acceptance probability of 0.8.
Both NUTS and the current implementation of Hamiltonian Monte Carlo used in Stan, however, are not Hybrid Monte Carlo schemes. Instead they sample from an entire trajectory whose length is grown dynamically around the initial point. Because of that the above analysis isn’t immediately applicable.
In order to use the above step size optimization criterion analysis one has to come up with a proxy Metropolis acceptance probability (incidentally this is which is why it’s called “accept_stat” and not “accept_prob” in the diagnostics). The current version of Stan pretends that each point in a given numerical trajectory could have been a Metropolis proposal and averages over the corresponding Metropolis acceptance probabilities uniformly.
We do not, however, sample states within a numerical trajectory uniformly in the algorithm. Instead we sample them based on their weights, e^{H(q, p)}. By weighting the fake Metropolis acceptance probabilities uniformly we over emphasize those states that are unlikely to be chosen, and end up with a conservative proxy acceptance statistic.
This new branch considers each state as a proxy proposal but weights the corresponding acceptance probabilities by the state weight, e^{H(q, p)}. In this way those states that are likely to be chosen have a greater influence on the proxy acceptance statistic, and any state with vanishing weight is not considered at all. This will always lead to a higher proxy acceptance stat for a given step size, which will cause the adaptation to push toward higher step sizes and, for a fixed integration time, cheaper numerical trajectories.
The effect of this change should be most evident in higher dimensions where the change in Hamiltonian values along a numerical trajectory increase. At the same time the optimality analysis is most accurate in higher dimensions (in the paper we utilize an expansion in step sizes but the coefficients of that expansion implicitly depend on dimension). This is the same as the 0.234 analysis of Roberts et al which is a purely asymptotic result and isn’t optimal for lower dimensional targets.
That said, the real issue here is that forcing the step size to larger values makes the generalized No-U-Turn criterion less accurate. Combined with the multiplicative trajectory expansion this can result in a heavy tail of longer numerical trajectories, especially in lower dimensions. Once again this is confirmed in the empirical results.
Ultimately the theory, which is better posed in higher dimensions, strongly motivates the reweighed acceptance statistic, which is backed up by empirical results. The cost is lower performance of low dimensional models, but these models run so quickly that the added cost isn’t particularly relevant.
Tomorrow I’ll update all of the tests and submit an issue/PR. Thanks again for the help with the empirical tests @jroon and @bbbales2!