Possibility of using dual averaging technique for the whole sample (not only during warm up)


Dear experts,

Is it possible to use the dual averaging algorithm in HMC and NUTS for the whole sample size?
(if we want to keep the acceptance rate till certain value)

Does it cause any problem in ergodicity or disturbing the Markov chain’s stationary distribution?

Any help would be greatly appreciated.


What are you actually trying to do?


I’m trying to keep the acceptance rate of HMC around certain value (0.65) for the whole sample which is relevant to controlling the number of model calls in my problem.

Is it disturbing the Markov chain’s stationary distribution by any chance?


Yes because Stan can no longer calculate the correct acceptance statistic so the forward and reverse simulation in the integrator no longer represents a reversible path (that may not be exactly the right term). You can set adapt_delta (I think that’s the one) to a variety of values and that should get you close to what you want.


Thank you very much for your helpful point, Sakredja.

I have another question which is irrelevant to my first question. but I’m very curious to know the answer.
we have a 1-D likelihood function (with mean zero) and multivariate standard normal prior.
If we have a sequence of sigma for my likelihood function (basically shrinking the sigma of likelihood from 1 to 0.3 using exponential decay), is it also disturbing the Markov chain’s stationary distribution?

Does it change the mean of posterior distribution?

We are just doing that to find/capture the sample points from our interested region.

Any help would be greatly appreciated.
Thank you!


In general basically “yes”, but that brings back the original question: what are you actually trying to do! Please just start a new thread, it’ll get lost down here.


The dimensionality of \theta in prior p(\theta) and likelihood p(y | \theta) should match.

The answer’s almost always “no” with an adaptation scheme unless done very carefully (which almost never matches intuitions about what a good method would look like). Specifically, you need to prove that any MCMC algorithm you devise preserves the correct stationary distribution (usually the posterior but for Stan, always the log density defined by the Stan program). Fair warning—it’s not easy, which is why NUTS was such a breakthrough.

The easiest way to do that is through detailed balance. Guessing isnt’ a good strategy in this business. The usual approach is to start with Metropolis and learn why that satisfies detailed balance, then go onto Metropolis-Hastings and Gibbs. Then basic HMC is just an instance of Metropolis-Hastings. You can then look at some of the adaptive Metropolis algorithms, which go in the direction you’re asking about. For NUTS, the Hoffman and Gelman paper does a good job explaining all the steps required for maintaining detailed balance.


Thank you so much for the helpful hint about the detailed balance.