Confused about accept_stat__ and delta

I am tweaking a model and realized that I am confused about something fundamental. Reading the original Hoffman & Gelman (2014) NUTS paper, I was under the impression that what is called accept_stat__ in Stan is \alpha / n_\alpha in the paper, which is adapted to \delta in the dual averaging initial phase, which is in turn delta for Stan.

Yet for pretty much everything I run, the average accept_stat__ is usually much higher than delta = 0.8 (the default). What am I missing?

Eg for the simplest of models (using 2.20.0):

Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (0.0085) seconds, 0.0085 seconds total
Sampling took (0.016) seconds, 0.016 seconds total

                Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s    R_hat
lp__            -7.2  2.8e-02  6.3e-01   -8.4  -7.0  -6.7    509    32235  1.0e+00
accept_stat__   0.94  2.7e-03  9.8e-02   0.73  0.98   1.0   1268    80282  1.0e+00
stepsize__      0.88  3.8e-15  3.8e-15   0.88  0.88  0.88    1.0       63  1.0e+00
treedepth__      1.4  1.6e-02  5.1e-01    1.0   1.0   2.0    975    61708  1.0e+00
n_leapfrog__     2.9  5.8e-02  1.6e+00    1.0   3.0   7.0    785    49698  1.0e+00
divergent__     0.00  0.0e+00  0.0e+00   0.00  0.00  0.00    500    31654     -nan
energy__         7.7  4.2e-02  9.2e-01    6.8   7.4   9.6    490    31008  1.0e+00
theta           0.24  4.9e-03  1.1e-01  0.087  0.23  0.44    502    31763  1.0e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

for the above example, the Bernoulli model is too easy to fit, so it’s not a good test case.

@betanalpha has been discussing this issue and is working on tweaks -

Thanks for linking that issue, but I am asking about something much more mechanical.

Namely, if the dual averaging is supposed to adapt accept_stat__ to values around \delta (which is 0.8 by default), why am I seeing consistently higher accept_stat__ values in most models?

This is not about Bernoulli, I am seeing this more generally (it was just the simplest example).

My understanding is that this question is orthogonal to whether the current accept_stat__ is the right statistic. It is only about something that is in practice ends up being higher than I expected.

Yes, adapt_delta is the target. Yes, we often overshoot it. I agree it’s confusing.

There are a couple of things going on. In simple models, it’s hard to tune to a low accept stat (where “low” can mean < 0.9 or something); you get good acceptance until a critical size where things just blow up. Keeping right on that knife’s edge is tricky and we tend to err on the higher acceptance side to avoid divergences.

In more complex models, especially with varying curvature, it’s hard to tune on average over the posterior—the problem’s as hard as sampling from the posterior to begin with.

1 Like

How is this implemented? Ie if adapt_delta = 0.8 is the target, which part of the code or the algorithm keeps it around 0.9 for the cases you mention?

(sorry if these are obvious questions, I can’t figure them out by looking at the code)

implemented here:

One has to keep in mind that the objective of dual averaging here is an expectation value and the utility of the tuning is only as good as the expectation value estimates are precise.

The objective expectation value is that of a particular acceptance statistic which captures hypothetical acceptance probabilities (the details aren’t important, but like much of the sampler details what is currently used in Stan is different from what is in the NUTS paper). What is important is that the variance of this acceptance statistic can be quite large which induces noise in the dual averaging, even under ideal conditions.

There are additional considerations that bias the dual averaging to overestimate the optimal step size. Firstly the dual averaging itself mixes the empirical updates with a particular initial value; if the last metric update is good then this initial value will underestimate the optimal step size and pull the objective to values larger than the target. Secondly the dual averaging actually tunes the log step size, and the asymmetry of the log transformation biases symmetric noise to larger values.

One can run a longer terminal adaptation window by setting num_warmup=1000+X and term_buffer=50+X to give the final step size tuning more time to achieve a better value but it’s unlikely to have a huge influence in practice. The dual averaging is most accurate for higher-dimensional models and so the opportunity for improving the adaptation is limited to simpler models which already fit really fast.

1 Like

Thanks for elaborating on the algorithm.

I hadn’t realized there were factors beyond it being hard to tune expectations with estimates and most of the step sizes being OK for simple models.

It’s just a consequence of Jensen’s inequality – the log stepsize is approximately unbiased but the stepsize is not. The variance in the adaptation of the log stepsize is skewed by the exponential map, resulting in the estimated stepsize being skewed to larger values.

1 Like

Can we just adapt the stepsize instead of log stepsize? That way we have logs of expectations (easy) rather than expectations of logs (hard). We can do the calculations stably the same way we calculate the log of predictive quantities (using log sum exp).

The original algorithm adapted the step size directly but it would run into issues early on where large updates pushed the step size below zero, or close enough to zero that early warmup iterations would saturate the maximum tree depth. This proved to be the most robust solution at the time. Anyone is welcome to experiment with techniques for adapting the step size with different regularizations to avoid such issues.