A primary motivation of case study, although my attempt at new, less-confusing terminology is still immature.
In terms of carbon per desired accuracy I’m not so sure.
But what stresses computation enough that it breaks? Often probability density functions that aren’t sufficiently uniformly concentrated.
In an ideal world we would accurately quantify the posterior distribution with some numerical algorithm and then examine it for its inferential utility, identifying degeneracies and indications of poor experimental design (although double ideally that would have been done already with simulations, but that’s another point altogether). In the real world we can’t always accurately quantify the posterior distribution which means that we can’t faithfully examine it for these issues. The miracle, as it were, is that for some algorithms like Hamiltonian Monte Carlo the failures of accurate quantification themselves can provide evidence for the inferential issues directly.
The subtlety here is that “failure” is not necessarily binary; given issues with the default sampler configuration max_treedepth can be expanded and adapt_delta can be increases to give the Hamiltonian trajectories more power to explore at the cost of more computational resources. By pushing the sampler one can try to return to the ideal circumstance of fitting accurately and then investigating the posterior, but that will in general be wasteful if the initial failure is already pointing at the problem!
One way to interpolate between these two extremes is to consider the amount of work that Hamiltonian Monte Carlo has to do as an additional diagnostic. Divergences indicate that the adaptation procedure hasn’t been able to converge to a sampler configuration that can fully resolve the posterior, but even if the adaptation procedure was successful it might end up with a sampler configuration that requires very long, expensive trajectories to explore the details of the posterior distribution. I also discuss this, and provide examples, in my case study.
Hi, I have really enjoyed this discussion. I am curious as to what others think of your comment on ESS. I am going to read the paper now and perhaps I will be able to comment on this comment with an answer to my own question.
If you are just concerned about the mean of a parameter, then you are good with an ESS of about 100; if more precision is needed, then 200-300, but really not more.
For example, with an ESS of 100 an estimation of a mean of a parameter is good while if you want more precision you are seeking 200-300 ESS. Should the default warmup be set and the model being addressed have 4 chains that is 20,000x4 post warmup samples. To me, if the number of effective samples represents only .12 % of the total samples (100/80,000) it seems incredibly low. Perhaps there is confusion on my part in your comment be use you suggested that four chains with 1000 warmup and 1000 monitored samples is usually sufficient. In that case, your suggestion for estimation of mean only requires that the number of effective samples be equal to 2.5% of the total number of monitored iterations.
Exactly. And that’s a shortcoming of HMC as a sampling technique.
But the problem can just be that HMC can’at fit the model. The model can still be reasonable and provide reasonable inferences. The “folk” in @andrewgelman’s use of “folk theorem” literally translates as “not a”.
Doesn’t that make a lot of assumptions about the models? The conclusion in the abstract is only
An immediate outcome of this result is that the suggested target average acceptance probability of 0.651 can be relaxed to 0.6≲a≲0.9 with larger values more robust in practice.
I don’t see how it explains 0.8 as the default acceptance target. I’d have thought 0.9 from the “larger values more robust in practice” part just from the abstract.
And how does the fact that specifying 0.8 acceptance target rarely leads to a 0.8 acceptance? Do we want to tweak so our empirical acceptance is 0.8 rather than just setting adapt_delta = 0.8? I’m genuinely confused by all this theory being used to justify practice.
It’s the party line from almost everyone that works on MCMC. David McKay called for “a dozen”, so we’re a little extreme asking for 100. The only reason we do that is so that our ESS estimators are more stable.
The key to understanding this is the MCMC central limit theorem. The MCMC standard error in estimating the mean of parameter theta is sd(theta) / sqrt(ESS(theta)). So error goes down as O(1 / sqrt(ESS)).
At ESS = 16, the standard error in estimating the mean is sd / 4.
At ESS = 100, it’s sd / 10.
At ESS = 10,000, it’s sd / 100.
You need to multiply ESS by 100 to reduce standard error by a factor of 10.
Given that uncertainty in a parameter theta is dominated by the posterior sd, not error in calculating posterior mean, there’s not much point to going beyond ESS = 100.
We may reduce the defaults. They’re way too large for most models. We really need around ESS > 50 per chain in order to estimate it properly. So I think we need a minimum of ESS = 200 across four chains. So that’s an autocorrelation time of 1000/50 = 20, which isn’t so bad.
What we really want to do is supply an ESS target and run until we hit that. We’re still working on that interface.
Edit: Now I know how to introduce myself with an interesting conversation pitch if I ever run into you :D
This is really interesting and informative. And the conversation certainly yields to the conversation of the initial post referencing a model with 40k iterations. Given the numbers you provide, I imagine that seasoned stan users utilizing complex models with many parameters hypothesize as to the number of iterations needed to hit these benchmarks. Rather than estimating a model with a very large number of iterations that will certainly achieve these desired number of effective samples, but also run all day.
Is it fair to accredit the smaller number needed of effective sample sizes to the NUTS sampling or HMCMC, or perhaps both?
Hopefully there’ll be another in-person StanCon eventually.
Both. HMC is what gets you from the slow diffusive processes of Gibbs or Metropolis to something that follows the flow with the gradient to make directed movement. But, HMC is very hard to tune for integration quantization, integration time, and metric (equivalently, step size, number of steps, and mass matrix).
NUTS is what makes it practical. First, we adapt the metric and the step size during warmup. This is very hard to tune in HMC, but if you can do that, most of the improvement comes from HMC. Stan lets you do that, so you could evaluate.
The second improvement is in number of steps. This also makes a big difference and is why NUTS tends to work better than even a well-tuned version of HMC.
If you want to learn more, check out @betanalpha’s overview:
A shortcoming presuming that the pathologies that induce divergences don’t induce bias in other Markov transitions as well, which so far has proven to be not true.
Perhaps read the paper? If only the main theorem (Theorem 9), figure captions (especially Figure 2), and Section 4 which discusses the practical consequences in detail.
It does for sufficiently long chains.
When there is large variance in the adaption the fact that we adapt log_stepsize (to avoid jumps to negative values early on in adaption when gradients in the tails can be large and the dual averaging allow for big updates) induces a small bias towards larger values via the Jensen inequality.
The final adaptation window can be expanded, making the step size adaptation more accurate, but only at the cost of reducing the final inverse metric adaptation window. The current settings were aimed at robustness.
Moreover some of this is irrelevant given that the proxy acceptance probability used to guide the dual averaging isn’t quite appropriate for multinomial sampling. A pull request to update it to something more theoretically appropriate, backed up by substantial corroborating empirical evidence, has been open since the summer.
It’s also a shortcoming of other MCMC techniques. My point is that it’s not the math or the model, but the model fitting.
I couldn’t get through the math. Figures 1 and 3 seemed to be about Gaussian examples. Figure 2 looked more general, but I didn’t understand the qualification in the text. It seems to say that the results don’t hold for NUTS.
My practical question is how to deal with systems where there’s a hierarchical model with some parameters that are correlated and not well identified. For instance, consider an IRT 2PL model’s discrimination and difficulty parameter for an item. In practice, what happens is that I will find divergences when I have the acceptance rate below 0.95 or something like that and no divergences with the acceptance rate above that. Is that consistent with 0.8 still being optimal? Or is this all qualified by “no divergences in the simulations”? And if so, is that no expected divergences at some level? Presumably with enough iterations, I might see another divergence, as might many other models for which there are no divergences in 4000 iterations. So it seems like I have to just do something like SBC and PPC no matter what I do.
Also, I had a more general question on this topic. Given that an acceptance rate of nearly 100% corresponds to actually simulating the Hamiltonian system accurately, it seems like what we’re doing is introducing a divergence into the Hamiltonian in order to speed up sampling. Won’t there be as much trouble with that being biased by curvature as it is with bigger divergences? Or is there some magic at 65% acceptance that changes things? I suspect that at an 80% acceptance rate on 1 iteration that our simulations would diverge more in subsequent orbits as they already seem to be beyond the stability limit.
It’s an inherent problem to statistical modeling. In practice your model is only as good as your ability to accurately quantify the corresponding posterior distribution, otherwise your inferences are not faithful to the model but instead some bastardization of the model and the computation. Sometimes a given measurement and elicited prior model just isn’t enough to support accurate computation given the tools available.
The problem is you can’t pick and choose the math. For example why are you still assuming that there is still a universal optimal acceptance probability for all target distributions? In random walk Metropolis that is an asymptotic phenomenon for iid normal target densities – for HMC it’s a much more universal approximation due to said math.
As stated in the theorems 0.8 is optimal only if the symplectic integrator is everywhere stable and there are no divergences.
Only if there are neighborhoods with tiny posterior mass but strong target density curvature. In that case the sampler will explore those neighborhoods only rarely and you might not see divergences until the chains have been run for some time. If there are no such neighborhoods – if the curvature is relatively uniform – then there will never be divergences no matter how long you run.
The optimistic view of this situation is that because the divergences are rare the neighborhoods of high curvature have sufficiently small posterior mass that any bias in their exploration will be negligible. From this perspective running and not seeing divergences could be interpreted as “to the resolution of my sampling there are no biases”.
The problem is that this isn’t universally true. Those regions of high curvature but small mass could hide neighborhoods of significant mass behind them, and the inability to explore those regions indicates the inability to transition from the region that’s currently being explored to the unexplored region. Think an hourglass geometry. This kind of geometry may not be common but it’s not impossible.
Ultimately the lack of divergences is a necessary but not sufficient diagnostic, just like every other diagnostic we have in Markov chain Monte Carlo. If you see divergences then there are problems, but not seeing them doesn’t guarantee that everything is perfect. About all we can say is that the longer your chains, and larger the effective sample size, the more sensitive the divergence diagnostic will be and wackier pathologies will have to induce significant bias without causing divergences.
Simulated-Based Calibration tests computation of posteriors induced only by data that is generated from one of its model configurations. That’s not sufficient to guarantee that a posterior induced from data external one of the model configurations will also be accurate.
Posterior retrodictive checks are not directly diagnostic of computational problems. If the posterior quantification is inaccurate then posterior retrodictive checks will be untrustworthy due to interactions between misfit and computational bias.
There are no sufficient conditions that guarantee accurate computation. All we can do is check as many necessary conditions as possible.
I’ll answer here but this conversation has gone way off topic and so I recommend starting a new topic if you want to continue.
The average acceptance probability is related to the bounded local errors of the integration when the symplectic integrator is stable. In this context we can figure out how to trade off error with computation – finding numerical integrators that are accurate enough to guide large transitions but not so accurate that they’re too expensive.
An unstable symplectic integrator behaves completely differently, which incidentally is the only reason we can actually identify divergences empirically. The error in unstable symplectic integrators is no longer bounded which prevents the construction of any explicit relationship between error and exploration. The paper shows this behavior in the normal case where the relationship between average acceptance probability and cost deviates outside of the theoretical bounds as soon as the integrator becomes unstable.
By “posterior retrodictive checks,” are you referring to what we usually call posterior predictive checks (as here: http://www.stat.columbia.edu/~gelman/research/published/A6n41.pdf and the earlier references there), or are you referring to something different, maybe using cross validation? With hierarchical modeling, the choice of predictive distribution depends on which replication you are interested in (new groups or new observations within existing groups), as we discuss in that paper. This also comes up with posterior predictive checking in brms and rstanarm.
OK, that’s another way of saying what I said, that it’s not the math or the model per se, but the model fitting.
At least for me, there’s only so much time in the day.
Yup. I’m more familiar with the usual MCMC example of plumbing a pond that has a deep narrow hole at some point.
Of course not, but posterior predictive checks will match fit to data.
I understand that in theory, I just wasn’t sure where the bound was. With an 80% acceptance rate target on a single orbit, we’re still taking big enough steps that we’re “diverging” from the Hamitonian by a factor of 0.8 because we have H(qN, pN) / H(q0, p0) = 0.8 rather than 1.0. It’s giving us a very bad solution of the ODE in terms of position; but it moves a long way and does a decent-ish job at preserving the Hamiltonian. I was thinking that slippage in the Hamiltonian might be bad enough to cause bigger divergences if we extend integration time. What’s remarkable is that it’s stable in the sense that the error remains about that same rate even if we run multiple orbits. If I recall, Leimkuhler and Reich work out the exact bound in terms of condition of the curvature (just 1 for a standard normal) for stability of the leapfrog.