Confused about accept_stat__ and delta

tpapp · July 28, 2019, 6:58pm

I am tweaking a model and realized that I am confused about something fundamental. Reading the original Hoffman & Gelman (2014) NUTS paper, I was under the impression that what is called accept_stat__ in Stan is \alpha / n_\alpha in the paper, which is adapted to \delta in the dual averaging initial phase, which is in turn delta for Stan.

Yet for pretty much everything I run, the average accept_stat__ is usually much higher than delta = 0.8 (the default). What am I missing?

Eg for the simplest of models (using 2.20.0):

Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took (0.0085) seconds, 0.0085 seconds total
Sampling took (0.016) seconds, 0.016 seconds total

                Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s    R_hat
lp__            -7.2  2.8e-02  6.3e-01   -8.4  -7.0  -6.7    509    32235  1.0e+00
accept_stat__   0.94  2.7e-03  9.8e-02   0.73  0.98   1.0   1268    80282  1.0e+00
stepsize__      0.88  3.8e-15  3.8e-15   0.88  0.88  0.88    1.0       63  1.0e+00
treedepth__      1.4  1.6e-02  5.1e-01    1.0   1.0   2.0    975    61708  1.0e+00
n_leapfrog__     2.9  5.8e-02  1.6e+00    1.0   3.0   7.0    785    49698  1.0e+00
divergent__     0.00  0.0e+00  0.0e+00   0.00  0.00  0.00    500    31654     -nan
energy__         7.7  4.2e-02  9.2e-01    6.8   7.4   9.6    490    31008  1.0e+00
theta           0.24  4.9e-03  1.1e-01  0.087  0.23  0.44    502    31763  1.0e+00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

mitzimorris · July 28, 2019, 9:07pm

for the above example, the Bernoulli model is too easy to fit, so it’s not a good test case.

@betanalpha has been discussing this issue and is working on tweaks -

github.com/stan-dev/stan

Step size adaptation targets wrong optimality criterion

opened 07:09PM - 10 Jul 19 UTC

closed 05:35AM - 12 Oct 19 UTC

betanalpha

#### Summary: The step size adaptation targets a suboptimal "acceptance stati…stic". #### Description: The requisite reading for what’s going on is https://arxiv.org/abs/1411.6669. There we showed that in the original Hybrid Monte Carlo scheme, where the numerical Hamiltonian trajectory is used as a Metropolis proposal, the optimal step size is the one that yields an average Metropolis acceptance probability of 0.8. Both NUTS and the current implementation of Hamiltonian Monte Carlo used in Stan, however, are not Hybrid Monte Carlo schemes. Instead they sample from an entire trajectory whose length is grown dynamically around the initial point. Because of that the above analysis isn’t immediately applicable. In order to use the above step size optimization criterion analysis one has to come up with an "acceptance statistic" that acts as a proxy for the Metropolis acceptance probability (incidentally this is which is why it’s called “accept_stat” and not “accept_prob” in the sampler parameters). The current version of Stan pretends that each point in a given numerical trajectory could have been a Metropolis proposal and averages over the corresponding Metropolis acceptance probabilities uniformly. We do not, however, sample states within a numerical trajectory uniformly in the algorithm. Instead we sample them based on their weights, e^{-H(q, p)}. By weighting the hypothetical Metropolis acceptance probabilities uniformly we over emphasize those states that are unlikely to be chosen, and end up with a conservative proxy acceptance statistic. #### Current Version: v2.19.1

tpapp · July 29, 2019, 4:39am

Thanks for linking that issue, but I am asking about something much more mechanical.

Namely, if the dual averaging is supposed to adapt accept_stat__ to values around \delta (which is 0.8 by default), why am I seeing consistently higher accept_stat__ values in most models?

This is not about Bernoulli, I am seeing this more generally (it was just the simplest example).

My understanding is that this question is orthogonal to whether the current accept_stat__ is the right statistic. It is only about something that is in practice ends up being higher than I expected.

Bob_Carpenter · August 2, 2019, 3:57pm

Yes, adapt_delta is the target. Yes, we often overshoot it. I agree it’s confusing.

There are a couple of things going on. In simple models, it’s hard to tune to a low accept stat (where “low” can mean < 0.9 or something); you get good acceptance until a critical size where things just blow up. Keeping right on that knife’s edge is tricky and we tend to err on the higher acceptance side to avoid divergences.

In more complex models, especially with varying curvature, it’s hard to tune on average over the posterior—the problem’s as hard as sampling from the posterior to begin with.

tpapp · August 4, 2019, 1:48pm

How is this implemented? Ie if adapt_delta = 0.8 is the target, which part of the code or the algorithm keeps it around 0.9 for the cases you mention?

(sorry if these are obvious questions, I can’t figure them out by looking at the code)

mitzimorris · August 5, 2019, 2:16am

implemented here:

github.com

stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/base_nuts.hpp

#ifndef STAN_MCMC_HMC_NUTS_BASE_NUTS_HPP
#define STAN_MCMC_HMC_NUTS_BASE_NUTS_HPP

#include <stan/callbacks/logger.hpp>
#include <boost/math/special_functions/fpclassify.hpp>
#include <stan/math/prim/scal.hpp>
#include <stan/mcmc/hmc/base_hmc.hpp>
#include <stan/mcmc/hmc/hamiltonians/ps_point.hpp>
#include <algorithm>
#include <cmath>
#include <limits>
#include <string>
#include <vector>

namespace stan {
  namespace mcmc {
    /**
     * The No-U-Turn sampler (NUTS) with multinomial sampling
     */
    template <class Model, template<class, class> class Hamiltonian,

This file has been truncated. show original

betanalpha · August 6, 2019, 3:24am

One has to keep in mind that the objective of dual averaging here is an expectation value and the utility of the tuning is only as good as the expectation value estimates are precise.

The objective expectation value is that of a particular acceptance statistic which captures hypothetical acceptance probabilities (the details aren’t important, but like much of the sampler details what is currently used in Stan is different from what is in the NUTS paper). What is important is that the variance of this acceptance statistic can be quite large which induces noise in the dual averaging, even under ideal conditions.

There are additional considerations that bias the dual averaging to overestimate the optimal step size. Firstly the dual averaging itself mixes the empirical updates with a particular initial value; if the last metric update is good then this initial value will underestimate the optimal step size and pull the objective to values larger than the target. Secondly the dual averaging actually tunes the log step size, and the asymmetry of the log transformation biases symmetric noise to larger values.

One can run a longer terminal adaptation window by setting num_warmup=1000+X and term_buffer=50+X to give the final step size tuning more time to achieve a better value but it’s unlikely to have a huge influence in practice. The dual averaging is most accurate for higher-dimensional models and so the opportunity for improving the adaptation is limited to simpler models which already fit really fast.

Bob_Carpenter · August 8, 2019, 6:18pm

Thanks for elaborating on the algorithm.

I hadn’t realized there were factors beyond it being hard to tune expectations with estimates and most of the step sizes being OK for simple models.

betanalpha · August 9, 2019, 2:56am

It’s just a consequence of Jensen’s inequality – the log stepsize is approximately unbiased but the stepsize is not. The variance in the adaptation of the log stepsize is skewed by the exponential map, resulting in the estimated stepsize being skewed to larger values.

Bob_Carpenter · August 20, 2019, 8:58pm

Can we just adapt the stepsize instead of log stepsize? That way we have logs of expectations (easy) rather than expectations of logs (hard). We can do the calculations stably the same way we calculate the log of predictive quantities (using log sum exp).

betanalpha · August 21, 2019, 3:56am

The original algorithm adapted the step size directly but it would run into issues early on where large updates pushed the step size below zero, or close enough to zero that early warmup iterations would saturate the maximum tree depth. This proved to be the most robust solution at the time. Anyone is welcome to experiment with techniques for adapting the step size with different regularizations to avoid such issues.

Topic		Replies	Views
Issue with dual averaging Algorithms	63	4132	April 12, 2021
Request for Volunteers to Test Adaptation Tweak Algorithms	103	5809	October 17, 2019
What acceptance probability does `adapt delta` target Algorithms	5	845	March 7, 2023
CmdStanPy NUTS sampler argument names `adapt_delta` vs `target_accept_rate` Interfaces	7	680	June 20, 2019
Accessible explanation to the No-U-Turn Sampler Algorithms	6	7525	January 9, 2018

Confused about accept_stat__ and delta

Related topics