Comparing Stan's adaptation phase to that of nuts-rs?

jsocolar · August 10, 2023, 1:10pm

The dual averaging behavior as it interacts with the term buffer (among other things) has been discussed in some detail here Issue with dual averaging. There’s a lot of good stuff on that thread; the term buffer issue is discussed specifically beginning at post #35.

Additionally, Stan targets a conservative proxy for the acceptance statistic that should lead to negative bias in step sizes and positive bias in actual acceptance rates:

github.com/stan-dev/stan

Improve step size adaptation target

opened 10:37AM - 22 Feb 22 UTC

nhuurre

algorithm documentation

#### Summary: The step size adaptation targets a suboptimal "acceptance stati…stic". #### Description: About two and half years ago issue #2789 pointed out a problem with the calculation of the adaptation target `accept_stat`. The issue was closed after a pull request with a proposed fix was completed but unfortunately the pull request was rejected and the problem is still present in Stan 2.29. This also [came up on the forum last summer](https://discourse.mc-stan.org/t/acceptance-ratio-in-nuts/22752). Background reading: https://arxiv.org/abs/1411.6669 finds the optimal stepsize for static HMC. In short: the ratio of accepted proposals to gradient evaluations is maximized when the average Metropolis acceptance probability is 0.8. The acceptance probability is related to the energy error in the trajectory ``` w = exp(-H(q,p)) accept_prob = min(1,w_2/w_1) ``` where `w_1` is the Boltzmann weight of the initial point and `w_2` of the final point. The sampler selects the candidate sample with probability `accept_prob` and adaptation adjusts the step size up or down depending on whether `accept_prob` is above or below 0.8. Before we try to apply this to Stan, I'll mention one thing that bugs me about this use of Metropolis acceptance probability as the adaptation target for static HMC. And that is: a significant number (about 1/3) of transitions have `w_2 > w_1` and an acceptance probability saturating to 1. But the magnitude of the energy error should be informative even in this case; it would be more efficient to not ignore it. To remedy this recall that the transition kernel is reversible: if q_1 -> q_2 is possible then so is q_2 -> q_1 and the rates at which these will happen are proportional to the Boltzmann weights w_1 and w_2. Taking the weighted average of acceptance probabilities in _both_ directions gives a symmetric acceptance statistic ``` accept_stat = 2*min(w_1, w_2)/(w_1 + w_2) ``` which has the same mean as the saturating acceptance probability but smaller variance. Now, of course, Stan's sampler is not static HMC. Instead of considering only the last point for acceptance or rejection, Stan chooses a Boltzmann-weighted multinomial sample from the entire trajectory, with bias towards the last doubling. Adaptation strategy is similar but now the acceptance statistic cannot simply be a Metropolis acceptance probability. The original [Hoffmann & Gelman NUTS paper](https://arxiv.org/abs/1111.4246) proposed using the average of `accept_prob` for all points in the last half of the trajectory as the proxy accept statistic and that's how Stan calculated it until #352 changed it to be the average over all points in the entire trajectory (even the unconditionally rejected ones, should the trajectory terminate early). I'm not sure which is better. On one hand short jumps that don't reach the last doubling should not be ignored, on the other hand they contribute much less to the effective sample size and I don't think you should count them as a full accepted sample when trying to maximize the number of accepted samples per computational effort. Either way, I'm going to point out that--just like with static HMC--you can reduce the variance without changing the mean by replacing the Metropolis probability with the symmetric acceptance statistic. Next up, issue #2789. The proposal there is that instead of a simple arithmetic average, accept_stat should be calculated as a _weighted_ average. This makes some amount of sense but there are two issues that concern me. 1. the proposed weights are Boltzmann weights _without_ considering the last-doubling bias so they do not really correspond to selection probabilities 2. the problem of counting short jumps as full acceptances may be even worse if very short distances tend to have smaller energy errors (and hence, higher weights) Also, it is not obvious whether the statistic to average should be the (saturating) Metropolis acceptance probability or the symmetrized acceptance statistic--unlike with the previous accept_stat proxies these give different results with Boltzmann weighting. Finally, let's consider an alternative idea: what if you use the probability of selecting from the last doubling as the accept_stat proxy? The sampler implements the last-doubling-bias in a very Metropolis-like fashion: if the total weight of the second subtree is greater than the first select a sample from the second, otherwise select it with probability W_2/W_1 (where `W`s are summed Boltzmann weights over the entire subtree). You could use this probability as the adaptation proxy with the rationale that the last doubling is what gives you most effective samples. There's two problems with this idea: 1. there's no theoretical guidance for selecting the optimal target value--the aforementioned static HMC analysis only applies to pointwise energy errors but this new accept_stat depends on trajectory-averaged Boltzmann weighs 2. empirically, it's totally unsuitable as an adaptation target because it's wildly non-monotonic ![acceptstat](https://user-images.githubusercontent.com/53308594/155112739-4f825d5b-3592-4c1c-a792-421d0835a1d9.png) I think the non-monotonicity in this plot mostly results from the symmetry of the 50-dimensional normal distribution. In a normal distribution the sampler moves in circles around the mean; at first the energy error increases until the simulated particle has reached 90 degrees from the starting point; but then starts decreasing and returns to almost zero when the movement is at 180 degrees around. If the stepsize is a power-of-two fraction of a full cycle then the two halves of the final trajectory are always symmetrical and have equal weight regardless of pointwise energy errors within them. So, what now? The last doubling acceptance rate is an attractive proxy for effective sample size per computational cost (in fact, in the above example effective sample size also goes up and down in sync with fluctuations in the acceptance rate) if you could somehow make it strictly monotonic. One possible monotonic proxy can be created by taking the symmetric last doubling acceptance statistic ``` accept_stat = 2*min(W_1, W_2)/(W_1 + W_2) ``` and applying Jensen's inequality to derive a lower bound for it ``` lower_bound_stat = 2*sum_i(min(w_1, w_i))/(W_1 + W_2) ``` where `w_1` is the average weight in the first subtree and `i` ranges over the second subtree. #### Current Version: v2.29.0

Topic		Replies	Views
Request for Volunteers to Test Adaptation Tweak Algorithms	103	5550	October 17, 2019
Issue with dual averaging Algorithms	63	3792	April 12, 2021
HMC (jittered) vs. NUTS on 1000-dimensional standard normal Algorithms mcmc	9	3894	April 29, 2019
Reference for current version of NUTS/HMC in stan? Algorithms	10	1766	March 15, 2022
Confused about accept_stat__ and delta Algorithms mcmc	10	2230	August 21, 2019

Comparing Stan's adaptation phase to that of nuts-rs?

Related topics