This post is outdated, see Runtime warnings and convergence problems for an improved and expanded version
So you got:
Warning: There were XXXX divergent transitions after warmup.
What does it mean? What to do?
Divergent transitions are a signal that there is some sort of degeneracy; along with high Rhat/low n_eff and “max treedepth exceeded” they are the basic tools for diagnosing problems with a model. Divergences almost always signal a problem and even a small number of divergences cannot be safely ignored.
We should note that resolving modelling issues is generally hard and requires some understanding of probabilistic theory and Hamiltonian Monte-Carlo, see Understanding basics of Bayesian statistics and modelling for more general resources.
What is a divergent transition?
For some intuition, imagine walking down a steep mountain. If you take too big of a step you will fall, so you need to take it slow. Now imagine walking on wide hill - if you take too small steps, you will take forever to explore it. The mountain or hill here is our posterior distribution. A divergent transition signals that Stan was unable to find a step size that would be big enough to actually explore the posterior while still being small enough to not fall. The problem is usually with somehow “uneven” or “degenerate” geometry of the posterior.
Further reading
- Identity Crisis - a rigorous treatment on the causes of divergences, diagnosis and treatment.
- Taming divergences in Stan models less rigorous, but hopefully more accessible intuition on what divergent transitions are.
- Divergent transitions in Stan reference manual
- A Conceptual Introduction to Hamiltonian Monte Carlo
Hints to diagnose and resolve divergences
Diagnosing modelling problems is best thought of as a part of larger workflow of model building, testing and critique/evaluation. Building blocks of such workflow are provided at Towards A Principled Bayesian Workflow by @betanalpha and in the Bayesian Workflow preprint by Gelman et al.
What follows is a list of brief hints that could help you diagnose the source of degeneracies in your model - or at least let you get faster help here on forums. Where they exist, we link to additional resources for deeper understanding. The aim is to provide a birds-eye view of approaches we’ve had success with in the past, point you to additional resources and give you keywords to search for :-) This is not, and can’t be a definitive guide - each degenerate posterior is problematic in its own way and there is no single approach that would always work.
If you fail to diagnose/resolve the problem yourself or if you have trouble understanding or applying some of the hints, don’t worry, you are welcome to ask here on Discourse, we’ll try to help!
-
Check your code. Divergences are almost as likely a result of a programming error as they are a truly statistical issue. Do all parameters have a prior? Do your array indices and for loops match? Do you have correct hard bounds (e.g. standard derivation parameters have
<lower=0>
, success probabilities have<lower=0, upper=1>
. Don’t use hard bounds to express prior knowledge).- If there are other errors/warnings in the Stan output (e.g.
Location parameter is inf, but must be finite!
), investigate their source, they usually hint at coding errors. Inrstan
, those warnings might not get displayed unless you run withchains = 1
.
- If there are other errors/warnings in the Stan output (e.g.
-
Create a simulated dataset with known true values of all parameters. It is useful for so many things (including checking for coding errors). If the errors disappear on simulated data and the model recovers the “true” parameters from the simulation, your model may be a bad fit for the actual observed data. Further reading:
- Falling (In Love With Principled Modeling) has examples of using Stan to simulate data.
- Failure to recover simulated group means in cross-classified LMM with monotonic predictor shows an R simulation for a complex model and how it helped diagnose a bug.
-
brms
can simulate datasets using thesample_prior = "only"
argument (see the docs for more details).
-
Reduce your model. Find the smallest / least complex model and a (preferrably simulated) dataset that shows problems. Only add more complexity after you resolve all the issues with the small model. If your model has multiple components (e.g. say a linear predictor for parameters in an ODE model), build and test small models where each of the components is separate (e.g. a separate linear model and separate ODE model with constant parameters).
-
Visualisations: in R you can use
mcmc_parcoord
from thebayesplot
package, Shinystan andpairs
fromrstan
. Further reading: -
Make sure your model is identifiable - non-identifiability (i.e. parameters are not well informed by data, large changes in parameters can result in almost the same posterior density) and/or multimodality (i.e. multiple local maxima of the posterior distributions) cause problems. Further reading:
- Case study - mixture models
- Identifying non-identifiability - some informal intuition of the concept and examples of problematic models and how to spot them.
- Underdetermined linear regression discusses problems arising when the data cannot inform all parameters.
- Interpretation of cor term from multivariate animal models - #26 by martinmodrak has an example where a varying intercept at individual-level is not identified.
-
Introduce more informative priors. However, this is a dangerous terrain - bad choice of priors can bias your inferences, so really think hard whether your prior is justifiable.
- When working on the logarithmic scale (e.g. logistic/Poisson regression) even seemingly narrow priors like
normal(0, 1);
can be actually quite wide (this makes an odds ratio/multiplicative effect ofexp(2)
or roughly7.4
still a-priori plausible - is that consistent with your domain expertise?). - Half-normal or half-student distribution is usually preferable to half-cauchy for sd parameters in varying intercept/effect models. Forum discussion, Gelman 2006
- If you have additional knowledge that would let you defensibly constrain your priors use it. Identity Crisis has some discussion of when this can help.
- Simulating data from the prior (a.k.a prior predictive check) is a good way to check if the priors and their interaction are roughly reasonable. Gabry et al. 2018 has an example.
- When working on the logarithmic scale (e.g. logistic/Poisson regression) even seemingly narrow priors like
-
Compare your priors to the posterior distribution. If the model is sampling heavily in the very tails of your priors or on the boundaries of parameter constraints, this is a bad sign, indicating that your priors might be substantially influencing the sampling. Here, setting wider priors can sometimes help.
-
Reparametrize your model to make your parameters independent (uncorrelated), informed by the data and to have a posterior without sharp corners, cusps, or other irregularities. The main part of reparametrization is to change the actual parameters and compute your parameters of interest in the
transformed parameters
block.Further reading:- Case study - diagnosing a multilevel model discusses non-centered parametrization which is frequently useful.
- The case study on hierarchical models by Mike Betancourt goes into more detail on the non-centered parametrization, Betancourt & Girolami 2015 addresses the same topic.
- Identifying non-identifiability - a sigmoid model shows an example of where the parameters are not well informed by data, while Difficulties with logistic population growth model - #3 by martinmodrak show a potential reparametrization.
- Reparametrizing the Sigmoid Model of Gene Regulation shows problems and solutions in an ODE model.
-
Move parameters to the
data
block and set them to their true values (from simulated data). Then return them one by one toparameters
block. Which parameter introduces the problems? -
Using a simulated dataset, introduce tight priors centered at true parameter values (known from the simulation). How tight need the priors to be to let the model fit? Useful for identifying multimodality.
-
Run the
optimizing
mode (penalized maximum likelihood) instead of sampling (NUTS) to check if the resulting values look at least roughly reasonable, if not try to find out why. -
Run Stan with the
test_grad
option - can detect some numerical instabilities in your model. -
Play a bit more with
adapt_delta
,stepsize
andmax_treedepth
; see here for an example. Note that increasingadapt_delta
in particular has become quite common as the go-to first thing people try, and while there are cases where it becomes necessary to increaseadapt_delta
for an otherwise well-behaving model, increases absent the more rigorous exploration options above can hide pathologies that may impair accurate sampling. Furthermore, increasingadapt_delta
will certainly slow down sampling performance. You are more likely to achieve both better sampling performance and a more robust model (not to mention understanding thereof) by pursing the above options and leaving adjustment ofadapt_delta
as a last-resort. Increasingadapt_delta
beyond 0.99 andmax_treedepth
beyond 12 is seldom useful. Also note that for the purpose of diagnosis, it is actually better to have more divergences, so reverting to default settings for diagnosis is recommended.