Divergent transitions - a primer

The challenge with trying to offer explicit recommendations is that any useful recommendation has to take into account the particular context of a given modeling problem or else it comes across as too generic and vague or ends up being wrong in that context. Take (9) – what exactly is a reparameterization and why does it help if it’s supposed to be the same model? What is supposed to be reparameterized? How is it supposed to be reparameterized? Time and time again I come across users frustrated by advice like this because they presume that it’s supposed to be sufficient and that the fact that they don’t know how to proceed is their fault.

In my opinion a checklist/flowchart is the wrong direction because it implies the false expectation that by following it you’ll be able to solve your problem, and if it doesn’t then users end up even more frustrated than before. And even if it doesn’t work they often don’t know why and hence how to generalize that resolution to other problems.

You’ll note that in my identifiability/degeneracy case study I don’t offer any particular recommendations but instead lay out a suite of methods to explore pathologies and build up understanding of what could be going wrong. It’s only in my modeling technique courses and case studies that I discuss particular pathologies inherent to the technique and common resolutions that work well in that context.

In your checklist for example, there’s little explanation for why any of the recommendations might be productive.

  1. Why would programming errors induce divergences? Explicit examples of common programming errors that lead to unintendedly pathological target density functions are often more useful (missing a prior implies a uniform prior, missing a bound will cause problems with component density evaluations, bad array indices will leave variables uninitialized, etc). See for example https://betanalpha.github.io/assets/case_studies/stan_intro.html#8_debugging_stan_programs.

  2. What “true” model configurations should be considered? Just one or many? Similarly how many simulated data sets per true model configuration? What happens if errors persist with the simulated data? What happens if the errors go away – what is “misfit” any how can it be addressed?

  3. Doesn’t really solve any problems. It’s useful, but for facilitating the investigating not investigating directly.

  4. What’s a parallel coordinate plot? What’s a good one and a bad one? Same for the pairs plots? If something looks weird what does one do with it?

  5. Not the technical definition of “identifiability”; I went with “degeneracy” in my case study to avoid terminology clashes and complaints from pedantic statisticians. What kinds of degeneracies can there be? How can each be moderated?

  6. Check priors for what? The checks being recommended sound like observing posterior sampling, not priors. I’m guessing you mean compare the posterior to the prior.

  7. Wide/large are undefined here. The first bullet does mention that unit scale isn’t always appropriate but how does someone figure out what is? Also the second bullet point is getting at a different issue – it’s not the scale parameter that’s problematic but the interaction of the scale parameter with other parts of the model. Understanding the interaction is what really matters here.

  8. Reparameterization is ill-defined because there are an infinite number of possible reparameterizations. Do you pick one from a box and try them one at a time? This is where it’s impossible to talk in generality and need to focus conversation in the context of particular modeling techniques.

  9. Doesn’t seem that different from 2 and 3.

  10. This works only when doing (2); it’s a good example how confusing recommendations can be without very carefully defining the appropriate context.

  11. [optimizing computes a penalized maximum likelihood, not a MAP] What is “reasonable”? How do you find out why something isn’t reasonable? What if anything does the penalized maximum likelihood tell you about the posterior.

  12. At what points should one run test_grad? When are numerical instabilities okay and when are they important?

  13. How does one play with the sampler configuration? What does each configuration do and why would it help? How do you respond to different behaviors when the configuration is changed?

If you expand these recommendations out to include context, examples, and the like then the entire thing strays away from being a check list but that’s exactly my point – discussion of motivation, context, and limitations is needed to ensure that users can employ the recommendations responsibly. Because few users have extensive statistical training – let alone training in Bayesian modeling, Markov chain Monte Carlo, and Hamiltonian Monte Carlo – they won’t know how to employ most of these recommendations appropriately and in cases without links they also won’t know how to follow up with additional reading.

Ultimately I recommend thinking in terms of a user who doesn’t know what any of these concepts are. Explain in a few paragraphs why the geometry of the posterior density can be problematic for computation and how modeling choices can influence that geometry. Then expand each bullet point with a few paragraphs to explain a type of pathology, why it’s problematic, how it might be addressed, and then reference further material. In other words instead of a checklist consider an introduction to the challenge world of Bayesian computation.

The in-depth discussion helps set the expectation that these issues, and their resolutions, are not simple. Most advanced users would also benefit from reinforcing important ideas.

3 Likes