I’m fitting categorical models with n = 2,318, two group-level effects per category contrast and 30 to 40 population-level effects per category contrast. The total number of estimated parameters is thus in the 100-120 range. I run 4 chains with 1,000 warmup iterations and 2,500 sampling iterations for a total of 10,000 post-warmup iterations.

I use N(0, 4) priors for the population-level parameters, and Exp(2) priors for the group-level SDs. These are are fairly weak as the intention is simply to eliminate extreme values.

Given that a single model takes over an hour to fit, having to refit a model due to divergent traditions is a big deal that I would rather avoid whenever possible. Thus, for example, I am inclined to ignore the following warning:

There were 2 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

Two divergent transitions in 10,000 draws, with every Rhat at 1.0, seem few enough to ignore, given that the divergence rate is only 0.02%.

Another, similar model had 22 divergent transitions for a total divergence rate of 0.22%. I am less sure about the safety of ignoring this. However, like I said above, refitting a model is no fun when it entails an hour-long wait.

In short: is there a guideline about what percentage of divergences out of the total number of non-warmup iterations is safe to ignore when all Rhats = 1?

you probably don’t like to hear this but… I would take every divergence seriously. I agree that there’s probably not a problem in this case, but having divergences is an indication that something is wrong. That it takes so long to sample could also be an indication.

I assume you’ve done prior predictive checks and that the priors are sane?

In both the 2-divergences case and the 22-divergences case? My own inexpert intuition was that the latter was much likelier to indicate a problem. You are correct that I don’t like to refit models, but if theres’s a serious risk of biased estimates, I have no choice. The question remains: how few divergences out of 10,000 posterior draws are few enough to ignore? I’d love to get more than one opinion too, if possible.

As for prior information, the priors I use have now been added to the original post. They are sane in the sense that they rule out preposterously large absolute values of the population-level logits. I have tested them against an equivalent frequentist model that is simple enough to achieve convergence, verifying that both the Bayesian model with these priors and the frequentist model with no priors yield similar parameter estimates when there is no (quasi-)complete separation. This is the desired behavior because I don’t want to implement shrinkage except in cases where the MLE is infinite.

I don’t fully understand the concept of prior predictive checks yet as this is the first time I ever hear about it. My population-level priors are all centered around 0 to reflect the principle of indifference. Their purpose is simply to prevent the sampler from considering logits greater than 10 in absolute value. If I’ve understood the idea of prior predictive checks correctly (and I may not have), simulating new y from these priors would simply yield an even distribution of the 4 outcome categories.

YMMV, but I take every divergences seriously. Even if I have one (1). In some cases you will see that it can even depend on the random seed (i.e., you run the model twice and first it looks ok, and the 2nd time you get 1 divergence). Even in those cases I’m a bit worried - divergences indicate that your posterior is a bit hard for MCMC to explore and thus indicate that you potentially have a biased posterior. That is something that should always worry you.

However, I do have colleagues who say “well that doesn’t matter since it was just x divergences, at least my effective sample size looks good for the parameters of interest.” It goes without saying I don’t subscribe to that view. :)

Concerning prior predictive check you basically make sure to set up your model with all priors. Then you sample the model only from the priors. Then you plot and see that they are sane and do not look funky on the outcome scale. This way you catch very strange behaviors you sometimes see when you combine priors.

Many times you will see that your priors are too broad, i.e., the sample struggles with parameter estimates that are ridiculous, e.g., a recent study I did showed that my priors allowed for projects where there were billions of people involved - that is not sane. We can for sure say that a project team with more members than the earth’s population is not realistic. :)