Meaning of divergences in prior predictive checks

Hi,
I am currently working on modelling low-density lipoprotein aggregation, using a nonlinear mixed-effects model. My priors are weakly-informative (or vague, still a bit hesitant about how to describe them, despite reading this). I am currently stuck on the prior predictive check (using sample_prior = "only" option in brms), and encounter divergences when I run my model.

Warning messages:
1: There were 227 divergent transitions after warmup. Increasing adapt_delta above 0.999999999999 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: Examine the pairs() plot to diagnose sampling problems

which I get when I run only 2 chains, with iterations = 3000 and warmup=1000.
Here’s the code I’m using:

fit_model  <- function(Data){
  
  fit<- brm(
    bf(LDL ~ (1 + exp ((gamma-Time) / delta))/(15*exp((gamma-Time) / delta) + alpha),
       alpha ~ 1 + (1|Pat/Cat),
       gamma ~ 1 + (1|Pat/Cat),
       delta ~ 1,
       nl = TRUE),
    prior = c(
      prior(normal(3000,1000), class="b",lb=0, nlpar = "alpha"),
      prior(normal(2.5,1), class="b",lb=0, nlpar = "gamma"),
      prior(normal(0.2,0.25), class="b", lb=0, ub=1, nlpar = "delta"),
      prior(student_t(3, 0, 59), class="shape"),
      prior(normal(0,0.5), class="sd", group="Pat:Cat", coef="Intercept", nlpar="gamma"),
      prior(normal(0,500), class="sd",group="Pat:Cat", coef="Intercept", nlpar="alpha"),
      prior(normal(0,0.5), class="sd", group="Pat", coef="Intercept", nlpar="gamma"),
      prior(normal(0,500), class="sd",group="Pat", coef="Intercept", nlpar="alpha")),
      
    data = Data, family = Gamma(link="inverse"),
    chains = 2,
    iter=3000,
    warmup = 1000,
    cores = getOption("mc.cores",1L),
    sample_prior = "only",
    thin = 1,
    control = list(adapt_delta = 0.999999999999, max_treedepth=15),
    verbose = TRUE
  )
  
  return(fit)
}
fit <- fit_model(under1500_bl101_sos100)

Now, I think the main thing I am struggling with in this situation is that I don’t actually understand what divergences mean in the context of prior predictive checks.
I tried to go back and read on divergences here and found this:

The primary cause of divergent transitions in Euclidean HMC (other than bugs in the code) is highly varying posterior curvature, for which small step sizes are too inefficient in some regions and diverge in other regions.

But, as far as I understand, when we’re doing prior predictive checks we are only sampling from the prior and don’t take the data into consideration, so what do divergences mean in this case, if above it seems they are defined in the context of the posterior? Is there something basic that I don’t understand?

I would really appreciate it if someone can give me some directions, even if it’s pointing out to some papers and/or blog posts. I spent a significant amount of time trying to understand it, but I seem to just be getting more confused. Thanks!

2 Likes

In this case, it would mean

The primary cause of divergent transitions in Euclidean HMC (other than bubs in the code) is highly varying prior curvature, for which small step sizes are too inefficient in some regions and diverge in other regions.

Divergent transitions can also be caused by code that is valid but that yield numerically inaccurate gradients. But in this case, I would imagine it is a combination of priors that are much too weak for the sort of nonlinearity being asserted in the model.

4 Likes

Hi, Ben,

Thanks for your reply, that’s pretty useful!
I tried to experiment with a bunch of other priors, which helped a little with the divergences (but not too much) and also changed the link function, which decreased their number from around 2000 to around 100, so I guess that was a good move in this case.
I also had a look at Jonah’s paper on visualization and changed my priors a little, based on the pairs() plot, which helped some more.

Interestingly enough, regardless of the changes in the priors and the link I made, every time I run the whole model and just sample from the posterior, I don’t run into any divergences. What’s more, my posterior predictive checks seem very reasonable and the predictions I get are reasonable enough. I even kept a small test set, and my model performed reasonably well on the unseen values as well. So I guess in this case the emphasis falls on the data and not the weak priors?

I also tried running a simpler model (same formula but less random effects) and still ran into the same issue - prior divergences, but no problems when sampling from the posterior.

I am really confused as to what to do next. I ran out of ideas of things to try out. And also am still not completely sure where the problem is.
Do you have any suggestions of things I could potentially look into?

Thanks again!

1 Like

I would say that your priors are putting positive probability on regions of the parameter space with high curvature and / or low numerical accuracy but conditional on the data, those regions have zero probability. So, your priors are probably too weak or otherwise not that consistent with the data, but the data saved you and the posterior draws are fine to make inferences with.

4 Likes

Could you please help me understand Why/where would a prior that contains only normal variables (and one student-t) have areas of high curvature or low numerical accuracy. Isn’t this the most basic use case?

Have a read about the centered and the non-centered parametrization for hierarchical models. Depending on your data at hand either parametrization can imply difficult high curvatures as long as the priors are not super strong on the between-group variability.

Thank you @wds15 for your fast reply. I know about the affect of centered parametrizations. However I don’t see how that takes affect in this case. It seems as though all priors are straight forward.

I am however unfamiliar with the “brm” function so it might be that the hierarchical structure is implied by it.

Hi Nadav,

We’ll be able to give better advice if you can post the brms/stan syntax that’s giving these divergences

Hi @andrjohns, It’s in the message that started this thread. Thanks.

Right, that model is an example of a hierarchical structure. Looking at just the formula:

    bf(LDL ~ (1 + exp ((gamma-Time) / delta))/(15*exp((gamma-Time) / delta) + alpha),
       alpha ~ 1 + (1|Pat/Cat),
       gamma ~ 1 + (1|Pat/Cat),
       delta ~ 1,
       nl = TRUE)

You can see that both alpha and gamma have random intercepts (grouped within Pat & Cat). These intercepts then also have prior distributions:

    prior = c(
      prior(normal(3000,1000), class="b",lb=0, nlpar = "alpha"),
      prior(normal(2.5,1), class="b",lb=0, nlpar = "gamma"),
      prior(normal(0.2,0.25), class="b", lb=0, ub=1, nlpar = "delta"),
      prior(student_t(3, 0, 59), class="shape"),
      prior(normal(0,0.5), class="sd", group="Pat:Cat", coef="Intercept", nlpar="gamma"),
      prior(normal(0,500), class="sd",group="Pat:Cat", coef="Intercept", nlpar="alpha"),
      prior(normal(0,0.5), class="sd", group="Pat", coef="Intercept", nlpar="gamma"),
      prior(normal(0,500), class="sd",group="Pat", coef="Intercept", nlpar="alpha")

This is a hierarchical model because alpha and gamma are both parameters with hyperparameters that also have prior distributions. Hierarchical models like this can induce this strong curvature that causes issues. For more background on that, see this section from the User’s Guide: https://mc-stan.org/docs/2_24/stan-users-guide/reparameterization-section.html

1 Like