We usually say a trajectory in HMC is diverging as soon as the energy error goes above 1000.
For a while now, I’ve commonly used much lower cutoff values to diagnose sampling issues. If the limit is lower, we typically have many more divergences, which makes it easier to figure out in what region in the posterior they are caused. And if the limit is lower, the divergent draws are also typically closer to the source (even more so if we look at the position where we first crossed the energy error limit. This is available when sampling in nutpie).
This made me wonder where the default of 1000 came from. Is that just “pretty much any big number”, or are there more specific reasons for choosing this one? Could a smaller value actually be a better default? The acceptance rate is zero for all practical purposes long before this cutoff, and if the energy error is that big, I wouldn’t expect it to go back to something useful even for much smaller energy errors.
I guess many users would hate that though, so many more divergences…
Yes. I think it’s also that once a trajectory starts diverging, it will usually diverge pretty far and not come back. Have you tried a lower threshold and seen a lot more divergences? If so, could you share the example you used.
It’s unfortunate in Stan that we only save the initial point from which a chain diverged. It’d be much better if we reported the last state before a divergence.
Indeed. Given it’s on the log scale, at 20, we’re already less than one in a billion acceptance.
In case it’s relevant, note that Stan saves both the initial point from which a divergent trajectory started (the sample from the previous iteration, provided that we are saving un-thinned samples), and also an additional point sampled from the divergent trajectory (the sample from the divergent iteration), which may or may not be the same as the initial point.
Unfortunately I never got around to writing the more detailed explanation of the point selection procedure that I had planned to do at the end of that old thread of mine which @jsocolar linked. If someone needs this info I could still try to find time do so, but I imagine it would be quite time consuming (else I would have probably already done it back then).
Thanks for the links!
I really like the idea of storing points in the trajectory where the energy error increased a lot.
This wouldn’t be too hard to add in nutpie as an optional output…
Have you tried a lower threshold and seen a lot more divergences? If so, could you share the example you used.
I never did any systematic experimentation, but over the last year or so I did this with a few models that had a low number of divergences, and the number of divergences always increased a lot.
In a simple funnel:
import pymc as pm
import pytensor.tensor as pt
import nutpie
with pm.Model() as model:
log_sd = pm.Normal("log_sd")
pm.Normal("x", sigma=pt.exp(log_sd))
compiled = nutpie.compile_pymc_model(model)
tr1 = nutpie.sample(compiled, progress_bar=False)
tr2 = nutpie.sample(compiled, max_energy_error=20, progress_bar=False)
int(tr1.sample_stats.diverging.sum())
# 177
int(tr2.sample_stats.diverging.sum())
# 986
Maybe it just takes too long to reach an energy error of 1000, and stops because either the diverging or the non diverging end of the trajectory makes a u-turn?
This must be hitting max tree depth before it could spin out too far. It would probably be a good idea to lower this threshold for more informative warnings. The only downside is that some users really don’t like to see any warnings and are not going to like being reminded that the density is diverging by a factor of exp(20).
One thing I learned is that the energy error increases a lot more after the sampler diverged then before that. So just returning the point with highest energy error, or the highest error increase, will probably tend to return crazy points that are not super useful. If i remember correctly what i did back then was something roughly like return a point sampled from the trajectory weighted by the error increase and the local density at that point. The idea being that a healthy sampler would always stay in the high density region, so anything that happens in low density regions cannot be the cause for why the samplers health declined.