Origin of cutoff point 1000 for a divergence

aseyboldt · January 24, 2025, 2:57pm

We usually say a trajectory in HMC is diverging as soon as the energy error goes above 1000.
For a while now, I’ve commonly used much lower cutoff values to diagnose sampling issues. If the limit is lower, we typically have many more divergences, which makes it easier to figure out in what region in the posterior they are caused. And if the limit is lower, the divergent draws are also typically closer to the source (even more so if we look at the position where we first crossed the energy error limit. This is available when sampling in nutpie).

This made me wonder where the default of 1000 came from. Is that just “pretty much any big number”, or are there more specific reasons for choosing this one? Could a smaller value actually be a better default? The acceptance rate is zero for all practical purposes long before this cutoff, and if the energy error is that big, I wouldn’t expect it to go back to something useful even for much smaller energy errors.
I guess many users would hate that though, so many more divergences…

Bob_Carpenter · January 24, 2025, 9:09pm

Yes. I think it’s also that once a trajectory starts diverging, it will usually diverge pretty far and not come back. Have you tried a lower threshold and seen a lot more divergences? If so, could you share the example you used.

It’s unfortunate in Stan that we only save the initial point from which a chain diverged. It’d be much better if we reported the last state before a divergence.

Indeed. Given it’s on the log scale, at 20, we’re already less than one in a billion acceptance.

jsocolar · January 24, 2025, 11:08pm

In case it’s relevant, note that Stan saves both the initial point from which a divergent trajectory started (the sample from the previous iteration, provided that we are saving un-thinned samples), and also an additional point sampled from the divergent trajectory (the sample from the divergent iteration), which may or may not be the same as the initial point.

Some potentially relevant discussion here

and also here

Raoul-Kima · January 25, 2025, 10:51am

Unfortunately I never got around to writing the more detailed explanation of the point selection procedure that I had planned to do at the end of that old thread of mine which @jsocolar linked. If someone needs this info I could still try to find time do so, but I imagine it would be quite time consuming (else I would have probably already done it back then).

aseyboldt · January 27, 2025, 4:02pm

Thanks for the links!
I really like the idea of storing points in the trajectory where the energy error increased a lot.
This wouldn’t be too hard to add in nutpie as an optional output…

@Bob_Carpenter

Have you tried a lower threshold and seen a lot more divergences? If so, could you share the example you used.

I never did any systematic experimentation, but over the last year or so I did this with a few models that had a low number of divergences, and the number of divergences always increased a lot.

In a simple funnel:

import pymc as pm
import pytensor.tensor as pt
import nutpie

with pm.Model() as model:
    log_sd = pm.Normal("log_sd")
    pm.Normal("x", sigma=pt.exp(log_sd))

compiled = nutpie.compile_pymc_model(model)
tr1 = nutpie.sample(compiled, progress_bar=False)
tr2 = nutpie.sample(compiled, max_energy_error=20, progress_bar=False)

int(tr1.sample_stats.diverging.sum())
# 177
int(tr2.sample_stats.diverging.sum())
# 986

Maybe it just takes too long to reach an energy error of 1000, and stops because either the diverging or the non diverging end of the trajectory makes a u-turn?

Bob_Carpenter · January 27, 2025, 4:14pm

This must be hitting max tree depth before it could spin out too far. It would probably be a good idea to lower this threshold for more informative warnings. The only downside is that some users really don’t like to see any warnings and are not going to like being reminded that the density is diverging by a factor of exp(20).

Raoul-Kima · March 9, 2025, 2:50pm

One thing I learned is that the energy error increases a lot more after the sampler diverged then before that. So just returning the point with highest energy error, or the highest error increase, will probably tend to return crazy points that are not super useful. If i remember correctly what i did back then was something roughly like return a point sampled from the trajectory weighted by the error increase and the local density at that point. The idea being that a healthy sampler would always stay in the high density region, so anything that happens in low density regions cannot be the cause for why the samplers health declined.

Topic		Replies	Views
Diagnosting posterior geometry with divergences General	7	1605	January 10, 2022
Sampler Diagnostics Developers features	39	2881	April 5, 2020
Getting the location + gradients of divergences (not of iteration starting points) Algorithms	21	3515	November 23, 2019
Divergence check does not satisfy time-reversibility Developers bug	1	351	January 7, 2024
Improve warnings for low ESS Developers	31	4143	August 5, 2020

Origin of cutoff point 1000 for a divergence

Related topics