I am tweaking a model and realized that I am confused about something fundamental. Reading the original Hoffman & Gelman (2014) NUTS paper, I was under the impression that what is called accept_stat__ in Stan is \alpha / n_\alpha in the paper, which is adapted to \delta in the dual averaging initial phase, which is in turn delta for Stan.
Yet for pretty much everything I run, the average accept_stat__ is usually much higher than delta = 0.8 (the default). What am I missing?
Eg for the simplest of models (using 2.20.0):
Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.
Warmup took (0.0085) seconds, 0.0085 seconds total
Sampling took (0.016) seconds, 0.016 seconds total
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -7.2 2.8e-02 6.3e-01 -8.4 -7.0 -6.7 509 32235 1.0e+00
accept_stat__ 0.94 2.7e-03 9.8e-02 0.73 0.98 1.0 1268 80282 1.0e+00
stepsize__ 0.88 3.8e-15 3.8e-15 0.88 0.88 0.88 1.0 63 1.0e+00
treedepth__ 1.4 1.6e-02 5.1e-01 1.0 1.0 2.0 975 61708 1.0e+00
n_leapfrog__ 2.9 5.8e-02 1.6e+00 1.0 3.0 7.0 785 49698 1.0e+00
divergent__ 0.00 0.0e+00 0.0e+00 0.00 0.00 0.00 500 31654 -nan
energy__ 7.7 4.2e-02 9.2e-01 6.8 7.4 9.6 490 31008 1.0e+00
theta 0.24 4.9e-03 1.1e-01 0.087 0.23 0.44 502 31763 1.0e+00
Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at
convergence, R_hat=1).
Thanks for linking that issue, but I am asking about something much more mechanical.
Namely, if the dual averaging is supposed to adapt accept_stat__ to values around \delta (which is 0.8 by default), why am I seeing consistently higher accept_stat__ values in most models?
This is not about Bernoulli, I am seeing this more generally (it was just the simplest example).
My understanding is that this question is orthogonal to whether the current accept_stat__ is the right statistic. It is only about something that is in practice ends up being higher than I expected.
Yes, adapt_delta is the target. Yes, we often overshoot it. I agree it’s confusing.
There are a couple of things going on. In simple models, it’s hard to tune to a low accept stat (where “low” can mean < 0.9 or something); you get good acceptance until a critical size where things just blow up. Keeping right on that knife’s edge is tricky and we tend to err on the higher acceptance side to avoid divergences.
In more complex models, especially with varying curvature, it’s hard to tune on average over the posterior—the problem’s as hard as sampling from the posterior to begin with.
One has to keep in mind that the objective of dual averaging here is an expectation value and the utility of the tuning is only as good as the expectation value estimates are precise.
The objective expectation value is that of a particular acceptance statistic which captures hypothetical acceptance probabilities (the details aren’t important, but like much of the sampler details what is currently used in Stan is different from what is in the NUTS paper). What is important is that the variance of this acceptance statistic can be quite large which induces noise in the dual averaging, even under ideal conditions.
There are additional considerations that bias the dual averaging to overestimate the optimal step size. Firstly the dual averaging itself mixes the empirical updates with a particular initial value; if the last metric update is good then this initial value will underestimate the optimal step size and pull the objective to values larger than the target. Secondly the dual averaging actually tunes the log step size, and the asymmetry of the log transformation biases symmetric noise to larger values.
One can run a longer terminal adaptation window by setting num_warmup=1000+X and term_buffer=50+X to give the final step size tuning more time to achieve a better value but it’s unlikely to have a huge influence in practice. The dual averaging is most accurate for higher-dimensional models and so the opportunity for improving the adaptation is limited to simpler models which already fit really fast.
It’s just a consequence of Jensen’s inequality – the log stepsize is approximately unbiased but the stepsize is not. The variance in the adaptation of the log stepsize is skewed by the exponential map, resulting in the estimated stepsize being skewed to larger values.
Can we just adapt the stepsize instead of log stepsize? That way we have logs of expectations (easy) rather than expectations of logs (hard). We can do the calculations stably the same way we calculate the log of predictive quantities (using log sum exp).
The original algorithm adapted the step size directly but it would run into issues early on where large updates pushed the step size below zero, or close enough to zero that early warmup iterations would saturate the maximum tree depth. This proved to be the most robust solution at the time. Anyone is welcome to experiment with techniques for adapting the step size with different regularizations to avoid such issues.