Solving prior-data conflicts in zero-inflated beta binomial model

I have been posting a lot here recently (here and here and here) as I work through Aki Vehtari’s workflow analysing substance use data.

It is a longitudinal hierarchical model examining the effects of amphetamine use at start of treatment on subsequent amphetamine use in the following year of treatment. outcome is days of amphetamine in the previous 28 days. This is measured at start of treatment and then at least once (but sometimes more) during treatment. set is the number of days it is possible to have used amphetamine at each measurement point (always 28). yearsFromStart is a numeric indicator variable, indicating the time each measurement was taken, in years of treatment, all values between 0 and 1. ats_baseline is the number of days of amphetamine use in the 28 days prior to starting treatment.

I have wound a long winding path through several models starting at Gaussian and ending with zero inflated beta binomial with a quadratic polynomial term, comparing each new model to the previous model with loo-cv and checking for prior-data conflicts using the powerscale_sensitivity() function from the priorsense package.

I have been trying to get my most recent model - which includes a random slope term - to work, and finally did with sufficient effective sample size and Rhat = 1.00 for all parameters. I had to use 7000 iterations of four chains (instead of the default 4 x 1000 in brms) to get there. Model specification.

fit_zibetabinomial_quad_rs_ats <- brm(formula = bf(outcome | trials(set) ~ ats_baseline*poly(yearsFromStart,2) + (yearsFromStart | encID)),
                                      data = workingDF,
                                      family = zero_inflated_beta_binomial(),
                                      prior = c(prior(normal(0, 5),
                                                      class = Intercept),
                                                prior(normal(0, 5),
                                                      class = b),
                                                prior(cauchy(0, 5),
                                                      class = sd)), # add prior on corr matrix
                                      save_pars = save_pars(all=TRUE),
                                      seed = 1234,
                                      iter = 7e3,
                                      refresh = 0,
                                      cores = 4,
                                      control = list(adapt_delta = 0.99))

Here is the output

  Links: mu = logit; phi = identity; zi = identity 
Formula: outcome | trials(set) ~ ats_baseline * poly(yearsFromStart, 2) + (yearsFromStart | encID) 
   Data: workingDF (Number of observations: 2301) 
  Draws: 4 chains, each with iter = 7000; warmup = 3500; thin = 1;
         total post-warmup draws = 14000

Multilevel Hyperparameters:
~encID (Number of levels: 687) 
                              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                     0.91      0.11     0.71     1.13 1.00     5782     9927
sd(yearsFromStart)                3.30      0.39     2.56     4.11 1.00     2891     6307
cor(Intercept,yearsFromStart)     0.91      0.08     0.71     1.00 1.00      759     2026

Regression Coefficients:
                                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept                            -4.58      0.13    -4.84    -4.32 1.00     7221    10014
ats_baseline                          0.21      0.02     0.17     0.25 1.00     4642     7479
polyyearsFromStart21                -20.51      3.85   -28.06   -13.00 1.00    10602    10617
polyyearsFromStart22                 -8.54      2.64   -13.63    -3.21 1.00    15726    11672
ats_baseline:polyyearsFromStart21    -2.76      0.70    -4.16    -1.40 1.00     6527     9481
ats_baseline:polyyearsFromStart22     2.77      0.41     1.97     3.56 1.00    14359    10770

Further Distributional Parameters:
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi     4.82      0.53     3.84     5.94 1.00     6821     9077
zi      0.07      0.02     0.02     0.12 1.00     8162     6562

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

The previous model, exactly the same except for the with no random slope term (i.e. with (1|encID) (yearsFromStart|encID)) sailed through: great Rhat and ESS, and no prior-data conflicts. The addition of that random slope however suddenly has prior-data conflicts for many of the parameters.

powerscale_sensitivity(fit_zibetabinomial_quad_rs_ats,
                       variable=variables(as_draws(fit_zibetabinomial_quad_rs_ats))[1:11]) %>%
  tt() %>%
  format_tt(num_fmt="decimal")

Up to now I have been using the approach of if there are no conflicts the model is good to go to the next step of posterior predictive checks etc. But now I am stuck. I have tried multiple difference version of these basic priors, including including a prior on the correlation matrix (which seems to be the problem) of the form prior(lkj(2), class = cor). But the conflicts remain. I wonder if I could get help: interpreting the meaning of the values in the ‘prior’ and ‘likelihood’ columns, using these to specify better priors maybe, or in fact any other advice.

I suspect this might something to do with the use of orthogonal polynomials. I know I have recommended them to you in the other thread (to deal with the correlation between the linear and quadratic terms), but these terms are now expressed in a different scale and you can see they’re associated with much larger coefficients.

Maybe it would help to set wider priors, specifically on yearsFromStart … or perhaps another option would be to obtain the orthogonal polynomials separately, rescale them, and then enter separate linear and quadratic predictors into your model.

Thank you for replying again @jverissimo. Can you explain what you mean by “these terms are now expressed in a different scale”?

Also when you say to use a wider prior on the “age terms” what variable do you mean, yearsFromStart?

Also could you explain how to go about rescaling the polynomials separately?

Yes, sorry, I meant yearsFromStart. You could just set wider priors for these effects. Im not sure that is the issue, but it could help.

By rescaling the polynomials, I meant that you can create two new variables (columns in your data frame) from poly(yearsFromStart, 2) and that these could then be rescaled (e.g., multiplied by 100), and then entered into the model separately. (You might lose something from this approach, though, e.g., if you wanted to plot the effect of yearsFromStart now you have two variables, not one)

The values in the columns are a sensitivity diagnostic measuring how much the posterior changes as the prior / likelihood is modified through power-scaling (aka tempering). The higher the value, the more sensitive to change. We suggest a threshold of 0.05 as indicating sensitivity, but as with many hard thresholds it is good to think of it as a guide only.

If you are having difficulty finding the source of the conflict, it may help to look a the diagnostic plots (powerscale_plot_dens() and others).

Also, if you think it is an issue with the prior on the correlation between the by-group Intercept and slopes, you could perhaps confirm this by trying a model that forces independence by using || instead of | to see if the conflicts disappear.

Adding to Noa’s comment that, sometimes what is reported as conflict is due to strong posterior dependencies and weak identifiability of marginals, that is the posterior marginals may be sensitive to the changes in prior, while the predictions are not sensitive to changes in prior. I’ve seen this happen with zero-inflated model, but don’t have it in any case study as we wanted to investigate it further (but then been busy with other things)

Thank you so much @n-kall. Sorry for late reply, was working on a grant then had a conference.

I ran the analysis using uncorrelated intercepts and slopes via the || in the random part of the model and, as you suggested might happen, all the prior-data conflicts disappeared. This is great but, as is usually the case: this led me to more questions.

(1) What does it actually mean when making the intercepts and slopes uncorrelated causes to the prior-data conflicts to disappear?

(2) Is there any implication for inference for this model? Do you interpret the higher-order effects any differently

Thank @avehtari, see my comment to n-kall above: what are the implications for my model if the posterior marginals are sensitive to changes in the prior? Does it affect inference if this is the case? Do you have any best-practice suggestions for this eventualilty?

Since you are using brm(), I’m assuming you didn’t choose which priors to scale, and then if removing the LKJ prior on correlation makes the prior sensitivity to go away, it would indicate that there is only weak information about the correlation via likelihood. Unfortunately, the implementation for selecting which priors are scaled is still a work in progress.

What is the quantity of interest? If the posterior for the quantity of interest does not have prior-likelihood-conflict, then go ahead as usual. Even better, choose which priors to focus on.

1 Like

As usual @avehtari the vast gulf between what you know and what I know means that what are simple answers for you lead me to more questions. Here are the latest round in response to your last comments

(1) I’m afraid I don’t know what you mean by ‘which priors to scale’. What does it mean to scale a a prior? The coefficients for the linear and quadratic yearsFromStart increased dramatically when I used the poly() syntax for the polynomials. Should I have changed the priors also to account for this change?

(2) What do you mean by ‘only weak information about the correlation via likelihood’. Do you mean that the data do not show a strong correlation between (in this case) the random intercepts and random slopes? I am still not clear why this would cause a conflict between prior and likelihood. Are ‘weak information’ and ‘absence of a strong correlation’ mean the same thing?

Apologies for all the questions.

See Section 2.2 in Detecting and diagnosing prior and likelihood sensitivity with power-scaling | Statistics and Computing, which discusses that sometimes we need to choose which priors to power scale. Unfortunately, the paper and the examples there (with code at GitHub - n-kall/powerscaling-sensitivity: Code for paper "Detecting and diagnosing prior and likelihood sensitivity with power-scaling") are currently the only examples for how to choose which priors to power scale. We should add an example to priorsense vignettes, too. Eventually when brms gets additional feature, it will be easy to choose on which priors to focus.

Data probably do not have information whether the correlation is strong or weak, but I can’t be certain without looking at the posterior.

The current threshold for prior and likelihood is quite low, and we have seen cases which would be better not be labelled as “conflict”. Both prior and likelihood are weakly informative, but also just above the threshold. This is more likely to happen when the model has individually (or in a small group) weakly identified parameters, even when the joint effect is quite well identified.

No need for apologies. It’s great that you ask, so we learn what is missing from our paper, documentation, and examples. Unfortunately, right now I’m tight on time, and my answers are short, but eventually we’ll longer answers with examples to your questions.

1 Like