I have missing values on a second-level, i.e., group-level, predictor in a hierarchical model[1]. I can create a simultaneous model for this variable as a response using the missing data syntax in brms
and essentially do on-the-fly imputation. I mainly want to confirm that, in this situation, the correct way to write the model is to use subset()
, index()
, and mi(x , idx)
to subset and index the second-level predictor when it appears as the response variable. I can also imagine what I think would be two mis-specified models, which I will also show.
An example may help illustrate. I’m using data from the UCLA stats website of test scores. Observations are grouped within schools (schid
). The outcome is math scores (math
) predicted by homework hours (homework
), student-teacher ratio (ratio
), and school socioeconomic-status (SES; meanses
). I introduce missingness on student-teacher ratio for one school. My model for student-teacher ratio uses school SES as a predictor.
First, the header to make it a reproducible example:
library(brms)
library(haven)
mlmdata <- as.data.frame(read_dta("https://stats.idre.ucla.edu/stat/examples/imm/imm10.dta"))
set.seed(1)
missing_school_var <- sample(unique(mlmdata$schid), size = 1)
mlmdata$ratio[mlmdata$'schid' == missing_school_var] <- NA
mlmdata$schid <- factor(mlmdata$schid)
mlmdata$unique_ob <- !duplicated(mlmdata$schid)
mlmdata$unique_ob_id <- cumsum(mlmdata$unique_ob)
Correct subset model:
formula_1 <- bf(math ~ 1 + homework + mi(ratio, idx = unique_ob_id) + meanses + (1 + homework | schid), family = gaussian()) +
bf(ratio | mi() + subset(unique_ob) + index(unique_ob_id) ~ 1 + meanses, family = gaussian()) +
set_rescor(rescor = FALSE)
fit_1 <- brm(formula = formula_1, data = mlmdata, chains = 4, cores = 4,
control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_1',
backend = 'cmdstanr')
Over-confident standard errors for missing variable model (leads to extremely slow sampling):
formula_2 <- bf(math ~ 1 + homework + mi(ratio) + meanses + (1 + homework | schid), family = gaussian()) +
bf(ratio | mi() ~ 1 + meanses, family = gaussian()) +
set_rescor(rescor = FALSE)
fit_2 <- brm(formula = formula_2, data = mlmdata, chains = 4, cores = 4,
control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_2',
backend = 'cmdstanr')
No random intercept variance in missing variable model:
formula_3 <- bf(math ~ 1 + homework + mi(ratio) + meanses + (1 + homework | schid), family = gaussian()) +
bf(ratio | mi() ~ 1 + meanses + (1 | schid), family = gaussian()) +
set_rescor(rescor = FALSE)
fit_3 <- brm(formula = formula_3, data = mlmdata, chains = 4, cores = 4,
control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_3',
backend = 'cmdstanr')
Versions:
> packageVersion('brms')
[1] ‘2.17.7’
> packageVersion('cmdstanr')
[1] ‘0.5.3’
> cmdstanr::cmdstan_version()
[1] "2.30.1"
-
A variable that is constant across subsets of the data as defined by a group-id variable ↩︎