Hello! I don’t have much experience with ZINB models and am unsure about prior selection for them. My dataset consists mostly of categorical predictors, except for Year (2015–2023), which is centered on 2019. The data is highly zero-inflated, with 86% of observations being zeros.
For the zero-inflation (zi) priors, I followed this guide, setting the share of zeros in the data as the mean for the zi intercept and the mean count of non-zeros as the mean for the main intercept. However, after performing a prior predictive check (for the proportion of zeros) using this method, I found that adjusting the zi mean to the proportion of zeros from IDs that always have zeros—and the intercept mean to the mean count of IDs with at least one nonzero—improved the check.
While this adjustment helped estimate the proportion of zeros accurately, the pp_check plots still showed implausibly large x-axis values. The dataset has a maximum Applications value of 20, yet the plots sometimes showed values in the tens of thousands. I experimented with the standard deviation and shape parameter, which yielded reasonable pp_check results for the three-way interaction model. However, when adding time as a factor, the issue reappeared (although not as extreme, only in the hundreds).
I find the shape parameter particularly challenging to understand and specify correctly. Unfortunately, I cannot share the data as it is confidential. Any insights or suggestions would be greatly appreciated!
Here is the first, 3-way model:
contrasts(data$Grade) <- contr.sum(length(levels(data$Grade)))
contrasts(data$School) <- contr.sum(length(levels(data$School)))
model_app_gps <- brm(
formula = bf(
Applications ~ Gender * Grade * School + offset(log(contract_length)),
zi ~ Gender + Grade + School + log(contract_length)),
family = zero_inflated_negbinomial(),
data = data,
sample_prior = "only",
prior = c(
prior(normal(log(0.78), 0.1), class = "Intercept"),
prior(normal(0, 0.1), class = "b"),
prior(normal(logit(0.64), 0.1), class = "Intercept", dpar = "zi"),
prior(normal(0, 0.1), class = "b", dpar = "zi"),
prior(gamma(1, 0.1), class = shape)),
chains = 4,
iter = 2000,
seed = 123,
cores = 4)
And the 4-way model:
contrasts(data$Grade) <- contr.sum(length(levels(data$Grade)))
contrasts(data$School) <- contr.sum(length(levels(data$School)))
model_app_gpst <- brm(
formula = bf(
Applications ~ Gender * Grade * School * Year_of_app_centered + offset(log(contract_length)),
zi ~ Gender + Grade + School + log(contract_length)),
family = zero_inflated_negbinomial(),
data = data,
sample_prior = "only",
prior = c(
prior(normal(log(0.78), 0.1), class = "Intercept"),
prior(normal(0, 0.1), class = "b"),
prior(normal(logit(0.64), 0.1), class = "Intercept", dpar = "zi"),
prior(normal(0, 0.1), class = "b", dpar = "zi"),
prior(gamma(1, 0.1), class = shape)),
chains = 4,
iter = 2000,
seed = 123,
cores = 4)
Based on the above these are the plots from ppcheck, in order: