I am observing a phenomena using the NUTS sampler that involves a tradeoff between the following things:
adapt_delta, mean accept stat, divergences
vs.
autocorrelation, ESS
In words, often times I have chains that converge to (seemingly) stationary time series with low autocorrelation, and thus high effective sample size. These chains have 10-20% divergences, which I understand from Robust Statistical Workflow with RStan can be a false positive. Following along with that guide, I increase adapt_delta and get fewer divergences, but all of the sudden my ESS falls from about 50% to <10%.
Even without increasing adapt_delta, some chains naturally result in higher mean accept stats and are heavily autocorrelated (<10% ESS) but have few divergences.
I’m curious how to rationalize this tradeoff and what next steps I should take in sampler optimization. I can post my model but it’s kind of annoyingly complicated and I don’t want to distract from the core of the problem.
I tend to think of it as trying to get the highest ESS (per time, per iteration, etc.) subject to the constraint that you do not get any divergent transitions. It is not as if having X divergent transitions is worth Y additional effective samples because the presence of divergent transitions means the sampler tried and failed to get into part of the parameter space. You don’t know how much posterior mass is down that tunnel that it couldn’t get into because its stepsize was too big, so you don’t know how much bias you have in your draws.
3 Likes
10-20% divergences is so much that it’s very unlikely they are all false positives, and it’s very likely that the posterior estimates are biased.
Which ESS computation do you use?
Yes.
2 Likes
I use n_eff which is found in summary(fit). N_eff <10% typically coincides with Bulk & Tail ESS warnings in my experience, although I don’t know which of the two it represents.
On that topic however, I have an additional question. I do not understand the difference between the “avg. (partial) autocorrelation” computed by STAN and sample (partial) autocorrelation computed by other packages. Using coda
for effectiveSize
and base-R’s acf
function, I get totally different values than STAN (and noticeably “better”, i.e. higher ESS & lower autocorrelation). I assume STAN’s is more suited to Hamiltonian Monte Carlo, but what’s the high-level difference?
Unfortunately we’re in transition and summary(fit)' is showing old estimate and
monitor(fit)` is showing the new estimates, that is, Bulk and Tail ESS.
No practical difference in autocorrelation function computations, but there are differences how these are transformed to effective sample size. See Effective sample size section in Rank-normalization, folding, and localization: An improved Rˆ for assessing convergence of MCMC.
Coda is doing multi-chain ESS wrong, is not doing splitting, and if I remember correctly is not using Geyer’s truncation rule. I don’t know how you use acf for multi-chain.
Bulk and Tail-ESS also are different from ESS for mean. See the above mentioned paper for explanation why we don’t recommend ESS for mean as a default.
The algorithm used in Stan is more suited for any MCMC than the previous version. See that paper for more.
It seems you posterior has a region that is difficult for the sampler and that’s why you get these autocorrelations. The reason why there is a “tradeoff” between divergences and ESS per sample is that if a divergence is detected the sampler stops trying to cope with this difficult region. So whenever the sampler detects a divergence it just skips the problematic part of the posterior, the part which leads to the low ESS per sample. So you get higher ESS, but the returned sample will be wrong as a part of the posterior was left out.
Usually this means that you need to (try to) reparametrize the model or change the priors, in order to make the shape of the posterior less difficult for the sampler.
3 Likes