Hello,
I have a potentially simple, general question about the relationship between adapt_delta and chain length. When a model yields divergent transitions, I often increase adapt_delta to a higher value (sometimes as high as 0.999) in an effort to reduce ‘false-positives’. My question is whether I should also be increasing the length of the chains to account for the fact that the sampler is taking smaller steps. Essentially, I’m wondering whether reducing step size actually eliminates the divergences or is simply preventing the sampler from reaching that portion of parameter space where divergences occurred previously and, if so, should I automatically increase chain length when I increase adapt_delta? Are there other diagnostics I can use to convince myself that I’ve actually explored the space completely?
Apologies if this is obvious and thanks in advance for any thoughts you offer.
adapt_delta generally requires longer trajectories which manifests as maxing out treedepth. If you max out treedepth regularly, you can increase it and thereby take more timesteps in each trajectory. If you do that, you won’t in theory need to increase number of iterations. A warning about that though, it will generally be the case that until you get into equilibrium, the initial stages of the exploration will be highly oscillatory. Short treedepth tends to bleed kinetic energy out of the system, and get you into the typical set faster. So, it makes sense to do a short run with short treedepth until the lp value stabilizes, and then take these final positions as your initializations for a longer run with longer treedepth.
In practice, the way to set number of iterations is by ensuring that multiple chains converge to the same thing, and that the effective sample size for your parameters is big enough. Once in equilibrium, increasing treedepth gives more effective samples per iteration.
Those are technically true positives. The Hamiltonian simulation has in fact diverged.
I think you mean maximum tree depth. You should if you’re hitting the max tree depth. But if you increase adapt_delta, that should improve mixing, so you should technically need shorter chains (in terms of number of iterations) to achieve the same sample size.
Simulated data coverge tests, like the Cook-Gelman-Rubin approach.
You can also do what @dlakelan suggests and make sure each chain converges to the same adaptation parameters. If not, you probably want to run warmup longer.
Thanks Bob - I am still trying to develop my intuition for how the various options (e.g. adapt_delta, max_treedepth, etc) affect the sampler and how changes in those values alleviate some of the warnings thrown. Your answer and @dlakelan are both helping with that.
They are true positives in the way Bob says, but I also sometimes say false positives in the sense that, unlike other divergences, they don’t indicate the need for reparameterization when they go away easily with an increase in the resolution of the sampler. So either true or false positive seems appropriate. Maybe we shouldn’t say either just to avoid confusion.
I was under the impression that a false positive divergence was when the leapfrogging tripped over the hard-coded threshold in the code but would have U-turned if the threshold had been some larger finite number.
I just meant that when changing adapt_delta is sufficient then the divergences are false positives from the perspective of identifying a region too pathological for Stan. If they couldn’t be fixed by adjusting tuning parameters then from this perspective they’d be true positives. But that’s just from that perspective.
I would say they are true positives for the settings in which they crop up. This doesn’t necessarily mean the model’s intractable or can’t be fit with other settings.
Did Andrew approve that terminology? I’d have thought he’d have squawked at “false positive” the same way he squawks at “random effects” and when “sample” is used for “draw” (he keeps telling me a sample is a set of draws because I’m the only other person that seems to be as picky as him about terminology). We try to be very pragmatic though, realizing this is human language, not computer code. So go ahead and use “random effects” if it makes you happy. Just be clear about what you’re defining!
I assume so since he did make a bunch of edits to the paper but didn’t change that bit. But to avoid unnecessary confusion I’m certainly open to changing the terminology we use in that part of the paper.
This is how I was using the terminology as well. @Bob_Carpenter’s point makes sense to me (these transitions are actually divergent). I was using “false positives” as a (lazy) shorthand for the situation @jonah describes here.
So what can we call “divergences that don’t end up signaling a deeper
problem that adapt_delta can’t fix”? The true/false positive terminology is
convenient but clearly too ambiguous, but something less verbose than what
I put in quotes above would be nice.
I like “preventable”, although maybe it should be “preventable without
reparameterization”. But even that longer version is quite a bit less
verbose, so that’s nice.
This doesn’t seem to do it justice. When I began using Stan and complained of these persistent divergences, I had colleagues suggest that I should just fit the model in JAGS because that did not result in the same errors. My sense is that there is a part of the user-community that would view all divergences as preventable if they just used a different algorithm/interface (I’m being mildly sarcastic here). I think this reflects some lack of understanding of what Stan is actually telling you when this happens.
Perhaps “persistent divergences”? This would allow one to distinguish from divergences that persist despite increases in adapt_delta (and indicate a pathological posterior) from those that are alleviated with increases in adapt_delta.
This is a good point and I think your experience with your colleagues is an unfortunately common situation. Of course, I don’t expect people to know what Stan’s warnings mean without first learning about them (and I know it’s not easy!), but it would be nice if they didn’t draw baseless conclusions about the implications of the warnings before learning about them. I’m not holding my breath though ;)
Sorry for missing this conversation. A divergence has a very formal definition – it’s the result of an unstable symplectic integrator whose shadow Hamiltonian level sets have become noncompact, hence the numerical trajectories fly off to infinity.
Now in practice it could take some time for the trajectory to fly off to infinity so we instead have to define some threshold at which we terminate early and call the trajectory divergent. Really this can only be done with the Hamiltonian, so we set a rather large threshold for Delta H > 1000 as “okay, it’s probably safe to assume that the trajectory is on it’s way towards infinity”.
Unfortunately, in some high dimensional models the variation in the Hamiltonian can reach that threshold even if the trajectory is stable! Hence the divergent flag is very much a false positive with respect to the decision “is the trajectory actually unstable and diverging”.