Divergent transitions - a primer

Status: Draft

This is an experiment in using Discourse topics for documentation as discussed at Discourse - issue/question triage and need for a FAQ the content has yet to receive feedback and tweaks from the broader community. Also, this is a wiki post, so everyone except the very new users of the forum can edit this topic - feel free to improve this. The goal of this topic is to build a brief overview of the main points and links to other resources, not a complete treatment of the topic.

Divergent transitions are a signal that there is some sort of degeneracy; along with high Rhat/low n_eff and “max treedepth exceeded” they are the basic tools for diagnosing problems with the model. Divergences almost always signal a problem and even a small number of divergences cannot be safely ignored.

What is a divergent transition?

For some intuition, imagine walking down a steep mountain. If you take too big of a step you will fall, but if you can take very tiny steps you might be able to make your way down the mountain, albeit very slowly. The mountain here is our posterior distribution. A divergent transition signals that Stan was unable to find a step size that would be big enough to actually explore the posterior while still being small enough to not fall. The problem is usually with somehow “uneven” or “degenerate” geometry of the posterior.

Further reading

  • Identity Crisis - a rigorous treatment on the causes of divergences, diagnosis and treatment.

Strategies to diagnose and resolve divergences

  1. Check your code. Divergences are almost as likely a result of a programming error as they are a truly statistical issue. Do all parameters have a prior? Do your array indices and for loops match?

  2. Create a simulated dataset with known true values of all parameters. It is useful for so many things (including checking for coding errors). If the errors disappear on simulated data, your model may be a bad fit for the actual observed data.

  3. Reduce your model. Find the smallest / least complex model and a (preferrably simulated) dataset that shows problems. Only add more complexity after you resolve all the issues with the small model. If your model has multiple components (e.g. say a linear predictor for parameters in an ODE model), build and test small models where each of the components is separate (e.g. a separate linear model and separate ODE model with constant parameters).

  4. Visualisations: use mcmc_parcoord from the bayesplot package, Shinystan and pairs from rstan. Further reading:

  5. Make sure your model is identifiable - non-identifiability (i.e. parameters are not well informed by data, large changes in parameters can result in almost the same posterior density) and/or multimodality (i.e. multiple local maxima of the posterior distributions) cause problems. Further reading:

  6. Check your priors. If the model is sampling heavily in the very tails of your priors or on the boundaries of parameter constraints, this is a bad sign.

  7. Avoid overly wide prior distributions, unless really large values of the parameters are plausible. Especially when working on the logarithmic scale (e.g. logistic/Poisson regression) even seemingly narrow priors like normal(0, 1); can be actually quite wide (this makes an odds ratio/multiplicative effect of exp(2) or roughly 7.4 still a-priori plausible).

  8. If you have additional knowledge that would let you defensibly constrain your priors use it. Identity Crisis has some discussion of when this can help. However, be careful to not use tighter priors than you can actually justify from background knowledge.

  9. Reparametrize your model to make your parameters independent (uncorrelated), constrained by the data and close to N(0,1) (a.k.a change the actual parameters and compute your parameters of interest in the transformed parameters block).Further reading:

  10. Move parameters to the data block and set them to their true values (from simulated data). Then return them one by one to parameters block. Which parameter introduces the problems?

  11. Introduce tight priors centered at true parameter values. How tight need the priors to be to let the model fit? Useful for identifying multimodality.

  12. Run Stan with the test_grad option - can detect some numerical instabilities in your model.

  13. Play a bit more with adapt_delta , stepsize and max_treedepth ; see here for an example. Note that increasing adapt_delta in particular has become quite common as the go-to first thing people try, and while there are cases where it becomes necessary to increase adapt_delta for an otherwise well-behaving model, increases absent the more rigorous exploration options above can hide pathologies that may impair accurate sampling. Furthermore, increasing adapt_delta will certainly slow down sampling performance. You are more likely to achieve both better sampling performance and a more robust model (not to mention understanding thereof) by pursing the above options and leaving adjustment of adapt_delta as a last-resort. Increasing adapt_delta beyond 0.99 and max_treedepth beyond 12 is seldom useful. Also note that for the purpose of diagnosis, it is actually better to have more divergences, so reverting to default settings for diagnosis is recommended.

If you fail to diagnose/resolve the problem yourself or if you have trouble understanding or executing some of the strategies outlined above, you are welcome to ask here on Discourse, we’ll try to help!


Thanks–this is great!