In the Reference Manual, it states:
Poor behavior in the tails is the kind of pathology that can be uncovered by running only a few warmup iterations. By looking at the acceptance probabilities and step sizes of the first few iterations provides an idea of how bad the problem is and whether it must be addressed with modeling efforts such as tighter priors or reparameterizations.
Do we have any intuitions on useful thresholds for detecting things going awry with either metric? (I’m working on some during-warmpup/sampling diagnostic ideas)
In some order,
- Max treedepth exceeded
- Low Neff
- Bad Rhat (multiple chains makes this work really well imo)
- Slow chains/high treedepths
- Some chains fast and some chains slow
I thought this worked pretty well: https://arxiv.org/abs/1905.11916 – that’s a heuristic of given two metrics guess which will do better. Sorta different from detecting things went awry – more a guess at what might break in the future.
Cool. It’s something we want to improve.
The workflow paper: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf probably has more of the inspiration – some combination of wanting to run faster + fail faster
There’s a channel on the slack where we’re talking about benchmarking this stuff: Mc-stan community slack (basically uses of https://github.com/MansMeg/posteriordb)
Oh, my bad; I used the word “metric”, which I know has a technical meaning, but I actually simply meant the values in the
stepsize__ columns. I’m nowhere near knowledgeable to try to tackle assessing the actual HMC metric proper (nor really understand what it even is!). I’m just looking to add some monitoring of the csv content during warmup (I already have divergence and treedepth watching, and presumably rhat/ess stuff shouldn’t be computed until sampling begins).
Hmm, have you seen this: Issue with dual averaging ?
Sometimes when I’m running models I would like to see the
lp__ in each chain, for instance. Then I could vaguely see if everything ended up in the same place.
This still goes back to the workflow paper, failing fast and all. If that chains aren’t going to around the same lp within 100-150 draws, time to kill the chains and see what is going on.
There are a few things here:
- Monitoring chains as they run
- Async cmdstan interface so it is possible to monitor stuff as it runs
- Doing analysis on killed chains – that might mean chains of different lengths at various stages of adaptation
- Restarting chains from where you killed them
I don’t mean to weigh you down with giant projects but if you’re feeling keen to write experimental R packages then yeah, we’re curious how much all these things are worth. I’ve talked to @jtimonen about this some.