Divergent transitions - a primer

Status: Cautiously recommended

This is an experiment in using Discourse topics for documentation as discussed at Discourse - issue/question triage and need for a FAQ the content has received some feedback and tweaks from the broader community, but has not been evaluated extensively. Also, this is a wiki post, so everyone except the very new users of the forum can edit this topic - feel free to improve this. The goal of this topic is to build a brief overview of the main points and links to other resources, not a complete treatment of the topic.


So you got:

Warning: There were XXXX divergent transitions after warmup.

What does it mean? What to do?

Divergent transitions are a signal that there is some sort of degeneracy; along with high Rhat/low n_eff and “max treedepth exceeded” they are the basic tools for diagnosing problems with a model. Divergences almost always signal a problem and even a small number of divergences cannot be safely ignored.

We should note that resolving modelling issues is generally hard and requires some understanding of probabilistic theory and Hamiltonian Monte-Carlo, see Understanding basics of Bayesian statistics and modelling for more general resources.

What is a divergent transition?

For some intuition, imagine walking down a steep mountain. If you take too big of a step you will fall, so you need to take it slow. Now imagine walking on wide hill - if you take too small steps, you will take forever to explore it. The mountain or hill here is our posterior distribution. A divergent transition signals that Stan was unable to find a step size that would be big enough to actually explore the posterior while still being small enough to not fall. The problem is usually with somehow “uneven” or “degenerate” geometry of the posterior.

Further reading

  • Identity Crisis - a rigorous treatment on the causes of divergences, diagnosis and treatment.

Hints to diagnose and resolve divergences

What follows is a list of brief hints that could help you diagnose the source of degeneracies in your model - or at least let you get faster help here on forums. Where they exist, we link to additional resources for deeper understanding. The aim is to provide a birds-eye view of approaches we’ve had success with in the past, point you to additional resources and give you keywords to search for :-) This is not, and can’t be a definitive guide - each degenerate posterior is problematic in its own way and there is no single approach that would always work.

If you fail to diagnose/resolve the problem yourself or if you have trouble understanding or applying some of the hints, don’t worry, you are welcome to ask here on Discourse, we’ll try to help!

  1. Check your code. Divergences are almost as likely a result of a programming error as they are a truly statistical issue. Do all parameters have a prior? Do your array indices and for loops match? Do you have correct hard bounds (e.g. standard derivation parameters have <lower=0>, success probabilities have <lower=0, upper=1>. Don’t use hard bounds to express prior knowledge).

    • If there are other errors/warnings in the Stan output (e.g. Location parameter is inf, but must be finite!), investigate their source, they usually hint at coding errors. In rstan, those warnings might not get displayed unless you run with chains = 1.
  2. Create a simulated dataset with known true values of all parameters. It is useful for so many things (including checking for coding errors). If the errors disappear on simulated data and the model recovers the “true” parameters from the simulation, your model may be a bad fit for the actual observed data. Further reading:

  3. Reduce your model. Find the smallest / least complex model and a (preferrably simulated) dataset that shows problems. Only add more complexity after you resolve all the issues with the small model. If your model has multiple components (e.g. say a linear predictor for parameters in an ODE model), build and test small models where each of the components is separate (e.g. a separate linear model and separate ODE model with constant parameters).

  4. Visualisations: in R you can use mcmc_parcoord from the bayesplot package, Shinystan and pairs from rstan. Further reading:

  5. Make sure your model is identifiable - non-identifiability (i.e. parameters are not well informed by data, large changes in parameters can result in almost the same posterior density) and/or multimodality (i.e. multiple local maxima of the posterior distributions) cause problems. Further reading:

  6. Introduce more informative priors. However, this is a dangerous terrain - bad choice of priors can bias your inferences, so really think hard whether your prior is justifiable.

    • When working on the logarithmic scale (e.g. logistic/Poisson regression) even seemingly narrow priors like normal(0, 1); can be actually quite wide (this makes an odds ratio/multiplicative effect of exp(2) or roughly 7.4 still a-priori plausible - is that consistent with your domain expertise?).
    • Half-normal or half-student distribution is usually preferable to half-cauchy for sd parameters in varying intercept/effect models. Forum discussion, Gelman 2006
    • If you have additional knowledge that would let you defensibly constrain your priors use it. Identity Crisis has some discussion of when this can help.
    • Simulating data from the prior (a.k.a prior predictive check) is a good way to check if the priors and their interaction are roughly reasonable. Gabry et al. 2018 has an example.
  7. Compare your priors to the posterior distribution. If the model is sampling heavily in the very tails of your priors or on the boundaries of parameter constraints, this is a bad sign, indicating that your priors might be substantially influencing the sampling. Here, setting wider priors can sometimes help.

  8. Reparametrize your model to make your parameters independent (uncorrelated), informed by the data and to have a posterior without sharp corners, cusps, or other irregularities. The main part of reparametrization is to change the actual parameters and compute your parameters of interest in the transformed parameters block.Further reading:

  9. Move parameters to the data block and set them to their true values (from simulated data). Then return them one by one to parameters block. Which parameter introduces the problems?

  10. Using a simulated dataset, introduce tight priors centered at true parameter values (known from the simulation). How tight need the priors to be to let the model fit? Useful for identifying multimodality.

  11. Run the optimizing mode (penalized maximum likelihood) instead of sampling (NUTS) to check if the resulting values look at least roughly reasonable, if not try to find out why.

  12. Run Stan with the test_grad option - can detect some numerical instabilities in your model.

  13. Play a bit more with adapt_delta , stepsize and max_treedepth ; see here for an example. Note that increasing adapt_delta in particular has become quite common as the go-to first thing people try, and while there are cases where it becomes necessary to increase adapt_delta for an otherwise well-behaving model, increases absent the more rigorous exploration options above can hide pathologies that may impair accurate sampling. Furthermore, increasing adapt_delta will certainly slow down sampling performance. You are more likely to achieve both better sampling performance and a more robust model (not to mention understanding thereof) by pursing the above options and leaving adjustment of adapt_delta as a last-resort. Increasing adapt_delta beyond 0.99 and max_treedepth beyond 12 is seldom useful. Also note that for the purpose of diagnosis, it is actually better to have more divergences, so reverting to default settings for diagnosis is recommended.

34 Likes

Thanks–this is great!

Would really love @betanalpha’s feedback on this, as a lot of the suggestion were borne out of paraphrasing/digging into his work, so I hope I didn’t lose any important details on the way.

2 Likes

I remembered @andre.pfeuffer, @tlyim had some useful points in an older threads: N best tips & tricks (or the go-to checklist) for new Stan model builders?, Top tips for beginner Stan users

Going to add those to the list. Happy if you correct me if you think something’s fishy

EDIT: I actually don’t really understand the bit about the mass-matrix, would you care to add it (and preferable share a link to forum thread/other resource that has more details).

1 Like

The challenge with trying to offer explicit recommendations is that any useful recommendation has to take into account the particular context of a given modeling problem or else it comes across as too generic and vague or ends up being wrong in that context. Take (9) – what exactly is a reparameterization and why does it help if it’s supposed to be the same model? What is supposed to be reparameterized? How is it supposed to be reparameterized? Time and time again I come across users frustrated by advice like this because they presume that it’s supposed to be sufficient and that the fact that they don’t know how to proceed is their fault.

In my opinion a checklist/flowchart is the wrong direction because it implies the false expectation that by following it you’ll be able to solve your problem, and if it doesn’t then users end up even more frustrated than before. And even if it doesn’t work they often don’t know why and hence how to generalize that resolution to other problems.

You’ll note that in my identifiability/degeneracy case study I don’t offer any particular recommendations but instead lay out a suite of methods to explore pathologies and build up understanding of what could be going wrong. It’s only in my modeling technique courses and case studies that I discuss particular pathologies inherent to the technique and common resolutions that work well in that context.

In your checklist for example, there’s little explanation for why any of the recommendations might be productive.

  1. Why would programming errors induce divergences? Explicit examples of common programming errors that lead to unintendedly pathological target density functions are often more useful (missing a prior implies a uniform prior, missing a bound will cause problems with component density evaluations, bad array indices will leave variables uninitialized, etc). See for example https://betanalpha.github.io/assets/case_studies/stan_intro.html#8_debugging_stan_programs.

  2. What “true” model configurations should be considered? Just one or many? Similarly how many simulated data sets per true model configuration? What happens if errors persist with the simulated data? What happens if the errors go away – what is “misfit” any how can it be addressed?

  3. Doesn’t really solve any problems. It’s useful, but for facilitating the investigating not investigating directly.

  4. What’s a parallel coordinate plot? What’s a good one and a bad one? Same for the pairs plots? If something looks weird what does one do with it?

  5. Not the technical definition of “identifiability”; I went with “degeneracy” in my case study to avoid terminology clashes and complaints from pedantic statisticians. What kinds of degeneracies can there be? How can each be moderated?

  6. Check priors for what? The checks being recommended sound like observing posterior sampling, not priors. I’m guessing you mean compare the posterior to the prior.

  7. Wide/large are undefined here. The first bullet does mention that unit scale isn’t always appropriate but how does someone figure out what is? Also the second bullet point is getting at a different issue – it’s not the scale parameter that’s problematic but the interaction of the scale parameter with other parts of the model. Understanding the interaction is what really matters here.

  8. Reparameterization is ill-defined because there are an infinite number of possible reparameterizations. Do you pick one from a box and try them one at a time? This is where it’s impossible to talk in generality and need to focus conversation in the context of particular modeling techniques.

  9. Doesn’t seem that different from 2 and 3.

  10. This works only when doing (2); it’s a good example how confusing recommendations can be without very carefully defining the appropriate context.

  11. [optimizing computes a penalized maximum likelihood, not a MAP] What is “reasonable”? How do you find out why something isn’t reasonable? What if anything does the penalized maximum likelihood tell you about the posterior.

  12. At what points should one run test_grad? When are numerical instabilities okay and when are they important?

  13. How does one play with the sampler configuration? What does each configuration do and why would it help? How do you respond to different behaviors when the configuration is changed?

If you expand these recommendations out to include context, examples, and the like then the entire thing strays away from being a check list but that’s exactly my point – discussion of motivation, context, and limitations is needed to ensure that users can employ the recommendations responsibly. Because few users have extensive statistical training – let alone training in Bayesian modeling, Markov chain Monte Carlo, and Hamiltonian Monte Carlo – they won’t know how to employ most of these recommendations appropriately and in cases without links they also won’t know how to follow up with additional reading.

Ultimately I recommend thinking in terms of a user who doesn’t know what any of these concepts are. Explain in a few paragraphs why the geometry of the posterior density can be problematic for computation and how modeling choices can influence that geometry. Then expand each bullet point with a few paragraphs to explain a type of pathology, why it’s problematic, how it might be addressed, and then reference further material. In other words instead of a checklist consider an introduction to the challenge world of Bayesian computation.

The in-depth discussion helps set the expectation that these issues, and their resolutions, are not simple. Most advanced users would also benefit from reinforcing important ideas.

3 Likes

First, thanks for taking time to respond in depth.

I’ll start with what I 100% agree with:

I agree, the topic overpromised and failed to set expectations. I also agree that background knowledge is important - here’s what I wrote in the thread on FAQ, which I believe is mostly a rephrasing of your position.

I’ve tried to rewrite the post to reflect this better. In particular, I renamed “strategies” to “hints” (because that’s what they are) and tried to better set expectations.

That’s a great point. Do you believe the wiki is now, after some rewriting less likely to make people assume the advice is “sufficient” ?

There might also be a bit of a difference in values between my and your approaches to pedagogy - from your writing, I get the impression that you put large value on people getting deep understanding of stats/modelling/domains even if this could mean fewer people engage with your writing. I put large value in letting a lot of people improve their stats/modelling, even if this means those improvements are small and incremental. I don’t think it is useful to try to resolve this difference here, I believe the approaches are complementary and there is enough common ground to often find solutions that satisfy both.

With that said I probably should describe my goals with this topic in more details. I really don’t want this to be a checklist. It also cannot be a definitive guide. I want it to be a map. Or a tower you climb to see where you may go - even though you might not be able to resolve the details of all the destinations. Many of the points are the most repeated recommendations I write in response to inquiries here (and I do believe they help, but I admit I do not keep track of ratio of resolved issues). So it felt useful to have this somewhere discoverable.

Speaking of specific goals/usage scenarios there are actually multiple, so maybe the topic could be reorganized along those goals (although I currently don’t see exactly how):

  • Discoverability of resources - most (all?) of the resources I linked are frequently mentioned in answers to user’s questions and also frequently considered valuable by the question askers. Since this is now a first hit when searching for “divergent”, users should be more likely to find the resources themselves.
  • Better questions - We get many questions where beginner users write a multiline brms formula or a 500 lines of Stan code, run, see divergences, incrementally move to adapt_delta = 0.999, max_treedepth = 20, still see divergences and post. If some of them discover this topic and instead try some of the “strategies”, e.g. find minimal model that still has divergences and post it along with a generator of simulated data, I would consider this a major success. As I said, I don’t aim for people being completely able to resolve their issues, but hope this nudges them to make one more step (and hopefully gain some generalizable insight on the way).
  • Discoverability of tricks by advanced users - Many of the hints are things I would have wanted to know while I was learning to use Stan (e.g. the “narrow priors around true values” hint). I believe even some advanced users may not be aware of all of those “attacks” - and they might be able to use them right just from the short hints.

I agree that the ideal situation would be if for each hint we would have a linked case study going into detail. We unfortunately don’t have that. I tried to address some of the particular points (many of which are good and relevant), but I still think this should be a map, not a definitive guide and so I preferred brevity and kept hints for which we currently don’t have good “further reading”. I guess good case studies or at least forum topics for each of the point might already exist, so I hope we will be able to find them and link to them from here.

I also admit that I read some of the points you raised as a bit aggressive, especially when the linked resources provide exactly the details you ask for (e.g. bayesplot vignette on diagnostics discusses how to use and interpret the mcmc_parcoord plot). I believe this was not your intention, but I have to say it didn’t really help. I also repeat that goal is to let users discover resources/tricks and be able to do one more step in diagnosis, not to be a definitive guide (and I understand this was not obvious from the previous version of the wiki).

Some more discussion follows:

I agree “identifibaility” is not best term and I like your usage of “degeneracy” for this kind of issues, but I hesitated to use it as most of the content on the forums uses “identifiability”. If others are for adopting “degeneracy” as the preferred term, this would definitely be a place where it should be changed.

I tried to provide some minimal guidance: "make your parameters independent (uncorrelated), constrained by the data and close to N(0,1)". I agree it is not exhaustive neither is it straightforwardly applicable, but do you think this is bad guidance?

Also note that this is a wiki post and you are welcome to expand/edit it or add links or correct mistakes directly (it’s OK if you prefer discussing below, though).

Thanks for the input, I believe it made the wiki better.

9 Likes

Currently the guidance is very R-centric, could @ahartikainen look over this and link to some resources in Python? (it is a wiki, so you should be able to freely edit this) As I don’t use Python myself, I am bit at loss…

3 Likes