Tips for priors on multilevel hyperparameters in a hierarchical zero-inflated beta-binomial regression

I have a model estimating rates of change in amphetamine use during the first year of treatment among people receiving opioid agonist treatment (e.g. methadone buprenorphine). Been working on it a while now, following the workflow laid out by Aki Vehtari in this case study and learning an enormous amount as I (very slowly) go.

After much looing and k-folding I arrived at a zero-inflated beta-binomial regression model with a quadratic trend term. It is specified in brms like this

fit_zibetabinomial_quad4 <- brm(formula = bf(outcome | trials(set) ~ ats_baseline*yearsFromStart + ats_baseline*I(yearsFromStart^2) + (1 | encID)),
                                data = workingDF,
                                family = zero_inflated_beta_binomial(),
                                prior = c(prior(normal(0, 3),
                                                class = Intercept),
                                          prior(normal(0, 4),
                                                class = b),
                                          prior(cauchy(0, 3),
                                                class = sd)),
                                save_pars = save_pars(all=TRUE),
                                seed = 1234,
                                refresh = 0,
                                cores = 4)

outcome is number of days amphetamine use, set is the number of possible days of amphetamine use in each observation period (this is always 28), ats_baseline is the primary predictor, number of days of amphetamine used in the 28 days prior to starting treatment. yearsFromStart is how many years from start of treatment each measurement was made. I(yearsFromStart^2) is the quadratic term.

This is the output from the model

Multilevel Hyperparameters:
~encID (Number of levels: 695) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.77      0.11     1.55     1.99 1.00     1054     1608

Regression Coefficients:
                               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept                         -4.84      0.17    -5.19    -4.50 1.00     1566     2189
ats_baseline                       0.36      0.02     0.32     0.40 1.00     1264     1385
yearsFromStart                     2.30      0.62     1.09     3.49 1.00     2462     2817
IyearsFromStartE2                 -2.18      0.73    -3.61    -0.75 1.00     2576     2882
ats_baseline:yearsFromStart       -0.81      0.09    -0.98    -0.64 1.00     1898     2440
ats_baseline:IyearsFromStartE2     0.66      0.10     0.47     0.86 1.00     2349     2951

Further Distributional Parameters:
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi     3.92      0.42     3.16     4.79 1.00     2107     2725
zi      0.10      0.03     0.06     0.16 1.00     2753     2899

All ESS and Rhat good. Good diagnostic plots etc.

I wondered if a random-slopes version of the model would work, with the only change being that (1 | encID) in the formula becomes (yearsFromStart| encID). I ran that model and got the following output

~encID (Number of levels: 695) 
                              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                     0.69      0.09     0.51     0.88 1.00     2319     2949
sd(yearsFromStart)                6.15      0.73     4.85     7.63 1.00      477     1260
cor(Intercept,yearsFromStart)     0.92      0.07     0.76     1.00 1.06       61      349

Regression Coefficients:
                               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept                         -4.17      0.12    -4.41    -3.94 1.00     2566     3181
ats_baseline                       0.31      0.01     0.28     0.34 1.00     2887     3136
yearsFromStart                    -1.80      0.90    -3.62    -0.02 1.00     1179     2205
IyearsFromStartE2                 -4.51      0.85    -6.16    -2.93 1.00     2685     3065
ats_baseline:yearsFromStart       -0.71      0.10    -0.91    -0.51 1.00     1608     2293
ats_baseline:IyearsFromStartE2     0.86      0.11     0.64     1.08 1.00     2820     2696

Further Distributional Parameters:
    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi     5.91      0.68     4.70     7.30 1.00     1312     2542
zi      0.06      0.02     0.02     0.10 1.00     2523     2117

Along with the warning Warning message: Parts of the model have not converged (some Rhats are > 1.05). Be careful when analysing the results! We recommend running more iterations and/or setting stronger priors. with the obvious culprit being the Multilevel hyperparameters. I am wondering if anyone has advice on how to specify these hyperparameters more intelligently than I have.

Hello @llewmills, perhaps the issue here is related to the quadratic form?

In this setup you have two terms involving the predictor yearsFromStart, but in the varying slopes model, only the first term varies by encID, right? But the systematic effect of yearsFromStart is quadratic, not linear, so what does ‘varying slope’ represent in context?

Thank you for responding @AWoodward. So are you saying that in a varying slope model if there is a quadratic term in the population-level effects there should also be one in the group-level effects? If so I was not aware of that. So does that mean the random effects portioin of the model specification should be + (yearsFromStart + I(yearsFromStart^2)|id) in my model specification?

Well, personally I think that this form of varying-slopes model with a polynomial predictor doesn’t generally make sense, because of linear dependence of the terms. The total effect of your yearsFromStartis a simultaneous function of both terms, so the actual shape of the effect implies strong joint constraints on the coefficients, which suggests that a multivariate normal prior distribution on the id level parameters is not reasonable. Then we should expect very strong correlation between these estimates which may lead to sampling problems.

I suspect that visualizing the conditional effects of yearsFromStartat the id level from this model may show that the variability in the coefficient for the first term doesn’t map clearly to variation in the form of the relationship that you’re intending to represent. If you’re convinced that you need some nonlinear function to capture this relationship, and that is form is different between subjects, then you might be better of with some form of smoothing.

2 Likes

“The total effect of your yearsFromStartis a simultaneous function of both terms, so the actual shape of the effect implies strong joint constraints on the coefficients, which suggests that a multivariate normal prior distribution on the id level parameters is not reasonable”.

That makes a lot of sense actually @AWoodward. I am not really convinced that I need a non-linear function to capture the relationship - by which I assume you mean the quadratic term - or anything else. When I treated the days used outcome as numeric outcome, the problem with the linear term on its own was that it would yield predictions that were less than 0, which does not make sense for a count. The binomial, and later beta-binomial regressions do not have that problem: predictions have a hard floor of 0, and the binomial and beta-binomial models consistently outperform the gaussian. The reason I kept going onto the quadratic model after the linear was that it does seem plausible that overall trends in drug-use would be non-linear, starting with a steep slope but asymptoting when use gets closer to 0 days. But I am not certain of this. And an issue with the quadratic model is that predictions from higher levels of ats_baseline have a steeper intital downward slope, reach a minimum and then rise again. I cannot tell if that rise is something occurring in reality or that the single turn in the quadratic is ‘forcing’ the predictions back up again. Maybe an exponential model would work better.

As you may have guessed the model with the + (yearsFromStart + I(yearsFromStart^2)|id) random portion was very unstable (I ran it today). Given the issues you mentioned do you think just keeping the random effects at (1|id) would be the best strategy? You mentioned ‘smoothing’. I am not really sure what that is.

As you can see I am not certain of anything.

Yes, seems reasonable to me at face value that a simple linear effect may well be too simple in this case.

Not for me to say if using the simpler varying intercept model would be reasonable. It depends on how much variation there is between subjects in the form of this effect, and that may be difficult to know in advance. You could do some residual analysis on this model to assess if there is evidence of any systematic lack-of-fit at the subject level. It seems plausible that there would be between-subject variation in the trajectory that you would want to accommodate here.

Something like a hierarchical additive model (smoothing) might be relevant here, which would relax the requirement to specify the form of the relationship in advance. There are various discussions of these models on the forum. I’ve previously recommended this paper as a good introduction (Hierarchical generalized additive models in ecology: an introduction with mgcv [PeerJ]); this uses ‘mgcv’ but there is good support for these models in brms. This may be a good option especially if you have a decent number of observations for each subject.

Another option to deal with the correlation between yearsFromStart and I(yearsFromStart^2) would be to use orthogonal polynomials, e.g., poly(yearsFromStart, 2). In this case, the linear and quadratic terms would be independent.

Also (so far as I understand it), in such a model, you could have random slopes only for yearsFromStart, to capture variation in the linear portion of the effect.
Or, of course, you could include both the linear and quadratic terms in the random slopes part―also as orthogonal polynomials.

Thank you for the input @jverissimo. What is the difference between I(yearsFromStart^2) and poly(yearsFromStart,2)?

The difference is that poly() ensures that the linear and quadratic terms have a zero correlation, which allows you to interpret them as the actual linear and quadratic parts of the yearsfromStart effect.

Note that you should use poly(yearsFromStart, 2) only, without adding the linear predictor as well. This is because poly() decomposes into two predictors in your model.

Thank you for the tips. Much appreciated. So the specification would be formula = outcome ~ ats-baseline*poly(yearsFromStart,2) + (yearsFromStart|I'd)?