I have missing values on a secondlevel, i.e., grouplevel, predictor in a hierarchical model^{[1]}. I can create a simultaneous model for this variable as a response using the missing data syntax in brms
and essentially do onthefly imputation. I mainly want to confirm that, in this situation, the correct way to write the model is to use subset()
, index()
, and mi(x , idx)
to subset and index the secondlevel predictor when it appears as the response variable. I can also imagine what I think would be two misspecified models, which I will also show.
An example may help illustrate. I’m using data from the UCLA stats website of test scores. Observations are grouped within schools (schid
). The outcome is math scores (math
) predicted by homework hours (homework
), studentteacher ratio (ratio
), and school socioeconomicstatus (SES; meanses
). I introduce missingness on studentteacher ratio for one school. My model for studentteacher ratio uses school SES as a predictor.
First, the header to make it a reproducible example:
library(brms)
library(haven)
mlmdata < as.data.frame(read_dta("https://stats.idre.ucla.edu/stat/examples/imm/imm10.dta"))
set.seed(1)
missing_school_var < sample(unique(mlmdata$schid), size = 1)
mlmdata$ratio[mlmdata$'schid' == missing_school_var] < NA
mlmdata$schid < factor(mlmdata$schid)
mlmdata$unique_ob < !duplicated(mlmdata$schid)
mlmdata$unique_ob_id < cumsum(mlmdata$unique_ob)
Correct subset model:
formula_1 < bf(math ~ 1 + homework + mi(ratio, idx = unique_ob_id) + meanses + (1 + homework  schid), family = gaussian()) +
bf(ratio  mi() + subset(unique_ob) + index(unique_ob_id) ~ 1 + meanses, family = gaussian()) +
set_rescor(rescor = FALSE)
fit_1 < brm(formula = formula_1, data = mlmdata, chains = 4, cores = 4,
control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_1',
backend = 'cmdstanr')
Overconfident standard errors for missing variable model (leads to extremely slow sampling):
formula_2 < bf(math ~ 1 + homework + mi(ratio) + meanses + (1 + homework  schid), family = gaussian()) +
bf(ratio  mi() ~ 1 + meanses, family = gaussian()) +
set_rescor(rescor = FALSE)
fit_2 < brm(formula = formula_2, data = mlmdata, chains = 4, cores = 4,
control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_2',
backend = 'cmdstanr')
No random intercept variance in missing variable model:
formula_3 < bf(math ~ 1 + homework + mi(ratio) + meanses + (1 + homework  schid), family = gaussian()) +
bf(ratio  mi() ~ 1 + meanses + (1  schid), family = gaussian()) +
set_rescor(rescor = FALSE)
fit_3 < brm(formula = formula_3, data = mlmdata, chains = 4, cores = 4,
control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_3',
backend = 'cmdstanr')
Versions:
> packageVersion('brms')
[1] ‘2.17.7’
> packageVersion('cmdstanr')
[1] ‘0.5.3’
> cmdstanr::cmdstan_version()
[1] "2.30.1"

A variable that is constant across subsets of the data as defined by a groupid variable ↩︎