Best way to identify pathologies from divergent transitions (general step by step workflow)

Hello,

following the post Redirecting to Google Groups

I am interested in learning how to do what @betanalpha suggested

At that most general level diverges occur when the posterior
on the unconstrained scaled features strong curvature. This
is straightforward to identify by running many iterations and
looking at the sampled that diverged and those that didn’t to
hunt down the source of the problem, which can then inform
what in your model is ultimately being problematic and what
you might want to adjust.

In particular, a practical example on what program and typical steps (tabs and choices in ShinyStan) to take in order to:

looking at the sampled that diverged and those that didn’t to
hunt down the source of the problem

For example I’m always trying to understand that “this sample” diverged because a particular parameter acquired a particular value, ideally, visualising all my 100 parameters for that divergent sampling, and somehow inferring what is the problematic parameter.

A good resource is this:

https://betanalpha.github.io/assets/case_studies/rstan_workflow.html

However, I could not grasp the workflow that the experienced Bayesian modelers take to identify “the reason” of divergence; for example the parameters and what sampling of them lead to divergence/instability.

I’m thinking that what is behind this sentence

Quantitative diagnostics can be found in the “Diagnose” tab. Divergences will also be shown in the plots created in the “Expore” tab.

Is really a basic first step by step approach to do diagnosis

1) use this tab to analyse this…
2) compare these parameters based on some logics and observe if this happen…

Is there any detailed case study about these strategies?

Divergences happen when the simulated Hamiltonian dynamics diverges from the true path. In practice, I see these come up in models in two ways: in hierarchical models as in the Michael’s case study, when parameters take on extreme values (\approx e^{\pm 15}).

You can identify these by looking at pairs plots in shinystan. If you see a funnel shape, it is likely the former, if they tend to accumulate near extreme values, then the latter.

Another plot that people find helpful is the parallel coordinate plot. There is a thread on it on the forum: Concentration of divergences

If you have parameters that are related to each other, such as in hierarchical models, this is a good place to start. Another place is looking at the traceplots of potentially problematic parameters. Look for places where the parameter gets ‘stuck’ for several iterations.

If you are getting the latter, you may be able to avoid it by having tighter priors. If you can’t change the priors, it can be helpful to keep the parameters in log space as much as possible. If you have positive parameters that you later need to take the log of in the model, you can sample on the unconstrained space to avoid doing and then undoing the transform. At this point, I do everything but simplex transformations manually for numeric stability.

If you go this route, make sure to include the Jacobian correction.

3 Likes

Also the Case study on mixture models maybe helpful.

I don’t think there are that much case studies on this, that’s why I am trying to cover those in my blog (but from previous discussions I think you are aware of those posts):
General intro to divergences + some links to examples: http://www.martinmodrak.cz/2018/02/19/taming-divergences-in-stan-models/
So far however, I did not delve too much into individual strategies in detail (besides the non-identifiability blog), that is left for the future.

Also there is IMHO really no general strategy besides “think hard” - to paraphrase: Converging models are all alike, every divergent model diverges in its own way.

2 Likes

Also I recently worked with Jonah Gabry to improve the vignette on visual diagnostics (of divergences and other issues) in the bayesplot package - it is not yet merged (pull request here: https://github.com/stan-dev/bayesplot/pull/153), but I’ve made the WIP version available in case it is of interest to you: https://popelka.ms.mff.cuni.cz/~cerny/tmp/visual-mcmc-diagnostics-wip.html

This is a good starting point, but often the problem is in a combination of parameters. That’s why the bivariete and trivariate plots in ShyniStan or using pairs and mcmc_parcoord are helpful. In my experience, samples marked as divergent do not necessarily occur at the exact problematic area of parameter space. I would even say, that divergent samples tend to be reported outside of the problematic region, somewhere before the geometry starts to be difficult (note this is my current guess, I am not an expert on the NUTS sampler). We’re having some discussion why this is the case and if/how to improve the diagnostics here: Getting the location + gradients of divergences (not of iteration starting points)

3 Likes

This is ideally the scenario I would like to find my self into. Namely, produce an neat summary plot and use to identify a pathology. I imagine that with 1000 paramters and more complex hierarchies it might not be that easy. @jonah any plan to add this to ShinyStan?

Could you make a concrete small example?

This is a great example of a starting point,something that together with all case studies that are spread through internet/discourseof I think should exist in a organized fashion: A guide on how to detect and solve the “typical” pathologies, from the most simple up to more complex with few case examples for each of them. This is really good to see. @Bob_Carpenter I imagine this as useful chapter from your next book on the modelling workflow. Myself as still nubby modeller, a complete guide on diagnosis would make my modelling process much more meaningful (unless I am missing some key online resource).

A fairly straightforward example:

parameters{
  real <lower=0> x;
}
model{
  x ~ gamma(a,b);
}

Try:

parameters {
real x_log;
}
model {
target += x_log; //jacobian
target += a * log(b) -lgamma(a) + (a-1) * x_log - b*exp(x_log);
}

If a and b are constant, the a * log(b) -lgamma(a) can be omitted.

This is indeed common. I think one of the new plots in the bayesplot vignette PR illustrates this nicely. When you have a funnel geometry the plotted divergences will not be in the really bad area of the funnel (the chains never got there!), but rather where the geometry starts to become problematic, i.e., when the funnel begins to narrow. You only get to see the whole funnel after you reparameterize to indirectly sample from it.

Do you mean adding the parallel coordinates plot? If so then yeah, it could go in shinystan too. It already exists in bayesplot as mcmc_parcoord().

I see. Yes I forget about bayesplot, I should explore that more rather than relying on ShinyStan so much

I have this problem with non-hierarchical models with hundreds of parameters and it can be challenging to identify which parameters to try visualizing. I’ve always wondered if it would be possible to test posterior samples to give clues about which parameters could be problematic, and then visualize with that subset of parameters.

For each parameter could you test whether the values where the divergence occurred is uniformly drawn from the marginal posterior. Presumably there’s some metric that would allow this (essentially asking if the divergent samples are plausibly from the marginal). Then you have a metric for each parameter and start by plotting the top say 10 “worst”.

Wouldn’t parameters where the divergences consistently occur at high or low values pop out from the others? It seems like that could be useful.

I think a good place to start generally is to plot one or a handful of “local” parameters against a scale parameter that they all share a dependency on. For complicated models you may have many different hierarchical scale parameters and then you can make several plots, one for each scale parameter, and each including a few of the parameters depending on that scale parameter.

For the ‘too many parameters to plot’ case, I wrote a function https://global.discourse-cdn.com/standard14/uploads/mc_stan/original/1X/b7864f42254fe1d3de1597ade2573009a6137526.R to try and identify problematic parameters. Limited testing suggested it was doing a reasonable job, you can find some further discussion in this post: Concentration of divergences

edit: If I’ve made any updates to the function, you’ll find it in the ctsem github repository…

I was thinking that would be really useful, a function like this that is able to print the first N parameters that have the divergent transitions more concentrated around a particular value.

In that way, if we have a school data set with 10K parameters, the problematic tau would be plotted first. Does it make sense?

I would code it myself, I have to find out how to extract the value of the parameters for the divergent samples (any suggestion?).

P.S. A option for normalizing all the plotted parameters would be useful to!

A more semi-automated approach here would be the following:

  • Create a data frame where rows are MCMC iterations and columns are all of the parameters in the model
  • Augment this with a column which is 0 if that iteration was not divergent 1 if it was divergent
  • Train a decision tree model using the MCMC parameter values to predict the divergent iterations
  • Inspect the decision tree to identify variables and parts of the parameter space that are associated with divergences

You could do something similar using eg a random forest or gradient boosted tree model and SHAP values.