# Subsetting missing data models for second-level predictor

I have missing values on a second-level, i.e., group-level, predictor in a hierarchical model. I can create a simultaneous model for this variable as a response using the missing data syntax in `brms` and essentially do on-the-fly imputation. I mainly want to confirm that, in this situation, the correct way to write the model is to use `subset()`, `index()`, and `mi(x , idx)` to subset and index the second-level predictor when it appears as the response variable. I can also imagine what I think would be two mis-specified models, which I will also show.

An example may help illustrate. I’m using data from the UCLA stats website of test scores. Observations are grouped within schools (`schid`). The outcome is math scores (`math`) predicted by homework hours (`homework`), student-teacher ratio (`ratio`), and school socioeconomic-status (SES; `meanses`). I introduce missingness on student-teacher ratio for one school. My model for student-teacher ratio uses school SES as a predictor.

First, the header to make it a reproducible example:

``````library(brms)
library(haven)
set.seed(1)
missing_school_var <- sample(unique(mlmdata\$schid), size = 1)
mlmdata\$ratio[mlmdata\$'schid' == missing_school_var] <- NA
mlmdata\$schid <- factor(mlmdata\$schid)
mlmdata\$unique_ob <- !duplicated(mlmdata\$schid)
mlmdata\$unique_ob_id <- cumsum(mlmdata\$unique_ob)
``````

Correct subset model:

``````formula_1 <- bf(math ~ 1 + homework + mi(ratio, idx = unique_ob_id) + meanses + (1 + homework | schid), family = gaussian()) +
bf(ratio | mi() + subset(unique_ob) + index(unique_ob_id) ~ 1 + meanses, family = gaussian()) +
set_rescor(rescor = FALSE)
fit_1 <- brm(formula = formula_1, data = mlmdata, chains = 4, cores = 4,
control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_1',
backend = 'cmdstanr')
``````

Over-confident standard errors for missing variable model (leads to extremely slow sampling):

``````formula_2 <- bf(math ~ 1 + homework + mi(ratio) + meanses + (1 + homework | schid), family = gaussian()) +
bf(ratio | mi() ~ 1 + meanses, family = gaussian()) +
set_rescor(rescor = FALSE)
fit_2 <- brm(formula = formula_2, data = mlmdata, chains = 4, cores = 4,
control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_2',
backend = 'cmdstanr')
``````

No random intercept variance in missing variable model:

``````formula_3 <- bf(math ~ 1 + homework + mi(ratio) + meanses + (1 + homework | schid), family = gaussian()) +
bf(ratio | mi() ~ 1 + meanses + (1 | schid), family = gaussian()) +
set_rescor(rescor = FALSE)
fit_3 <- brm(formula = formula_3, data = mlmdata, chains = 4, cores = 4,
control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_3',
backend = 'cmdstanr')
``````

Versions:

``````> packageVersion('brms')
 ‘2.17.7’
> packageVersion('cmdstanr')
 ‘0.5.3’
> cmdstanr::cmdstan_version()
 "2.30.1"
``````

1. A variable that is constant across subsets of the data as defined by a group-id variable ↩︎