I’m trying to fit a multilevel logistic regression model, but encounter divergences. I suspect that something is wrong with my model specification, or that the data are simply not identifiable. Since I’m just working out the analysis design for an experiment, I’m currently purely working with simulated data.
The data will come from an experiment. The outcome variable is binary. The treatment variable (intervention) is also binary. The experiment will be run across a couple of sites with different sample sizes. If available, we might also incorporate further categorical data, but this currently not incorporated
The base of the model reads like this:
base_model <- brm(outcome ~ intervention + (1 + intervention|site),
family = brms::bernoulli())
I’m primarily interested in the overall treatment effect across all study sites. However, it would also be interesting to investigate individual slopes for the study sites, depending on the sample size we achieve. This is why I opted for varying slopes and intercepts.
To avoid the divergences, I have increasingly tightened the priors. The current one’s are this:
c(
prior(normal(0, 1), class = b),
prior(normal(0, .2), class = sd),
prior(normal(-1, 1), class = Intercept),
prior(lkj_corr_cholesky(4), class = L)
)
This has helped in part, but not in full. I reran the simulated draws from the “true” parameters and the subsequent analysis 200 times, and the intervals and estimates seem to be reasonably well calibrated (13/200 = 6.5% of 95% intervals don’t contain the true parameter value).
However I still encounter divergences, especially if simulating larger data.
Is there something grossly wrong with how I try to model the data? Or might there simply be insufficient data to identify such a model?
For reference, this is the data generating code:
library(dplyr)
library(purrr)
library(brms)
library(marginaleffects)
bernoulli <- function(n, p) {
runif(n) < p
}
parameters <- tibble(
site = LETTERS[1:5] %>% rep(each = 2),
intervention = rep(c(TRUE, FALSE), 5)
) %>%
mutate(
n = case_when(
site == "A" ~ 30,
site == "B" ~ 300,
site == "C" ~ 50,
site == "D" ~ 20,
site == "E" ~ 100
),
intercept = case_when(
site == "A" ~ .1,
site == "B" ~ .15,
site == "C" ~ .01,
site == "D" ~ .05,
site == "E" ~ .1
),
effect = case_when(
site == "A" & intervention ~ 0,
site == "B" & intervention ~ .1,
site == "C" & intervention ~ .2,
site == "D" & intervention ~ -.1,
site == "E" & intervention ~ .1,
TRUE ~ 0
),
prob = intercept + effect)
# true effect
true_effect <- parameters %>%
filter(intervention) %>%
summarise(avg_effect = weighted.mean(effect, n)) %>%
pull(avg_effect)
The following seed seems to work fine with the given priors, but other instances produce divergences.
set.seed(0297346)
simulated_data <- parameters %>%
mutate(
# the n corresponds to the total sample size per site, so we need to divide
# by two to reach the sample size for each condition
outcome = map2(n, prob, ~bernoulli(.x / 2, .y))) %>%
unnest(outcome)
fit1 <- brm(outcome ~ intervention + (1 + intervention|site),
data = simulated_data, family = brms::bernoulli(),
prior = c(
prior(normal(0, 1), class = b),
prior(normal(0, .5), class = sd),
prior(normal(-1, 1), class = Intercept),
prior(lkj_corr_cholesky(4), class = L)
)
cores = 4)