Subsetting missing data models for second-level predictor

jflournoy · February 22, 2023, 8:20pm

I have missing values on a second-level, i.e., group-level, predictor in a hierarchical model^[1]. I can create a simultaneous model for this variable as a response using the missing data syntax in brms and essentially do on-the-fly imputation. I mainly want to confirm that, in this situation, the correct way to write the model is to use subset(), index(), and mi(x , idx) to subset and index the second-level predictor when it appears as the response variable. I can also imagine what I think would be two mis-specified models, which I will also show.

An example may help illustrate. I’m using data from the UCLA stats website of test scores. Observations are grouped within schools (schid). The outcome is math scores (math) predicted by homework hours (homework), student-teacher ratio (ratio), and school socioeconomic-status (SES; meanses). I introduce missingness on student-teacher ratio for one school. My model for student-teacher ratio uses school SES as a predictor.

First, the header to make it a reproducible example:

library(brms)
library(haven)
mlmdata <- as.data.frame(read_dta("https://stats.idre.ucla.edu/stat/examples/imm/imm10.dta"))
set.seed(1)
missing_school_var <- sample(unique(mlmdata$schid), size = 1)
mlmdata$ratio[mlmdata$'schid' == missing_school_var] <- NA
mlmdata$schid <- factor(mlmdata$schid)
mlmdata$unique_ob <- !duplicated(mlmdata$schid)
mlmdata$unique_ob_id <- cumsum(mlmdata$unique_ob)

Correct subset model:

formula_1 <- bf(math ~ 1 + homework + mi(ratio, idx = unique_ob_id) + meanses + (1 + homework | schid), family = gaussian()) + 
  bf(ratio | mi() + subset(unique_ob) + index(unique_ob_id) ~ 1 + meanses, family = gaussian()) + 
  set_rescor(rescor = FALSE)
fit_1 <- brm(formula = formula_1, data = mlmdata, chains = 4, cores = 4, 
             control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_1',
             backend = 'cmdstanr')

Over-confident standard errors for missing variable model (leads to extremely slow sampling):

formula_2 <- bf(math ~ 1 + homework + mi(ratio) + meanses + (1 + homework | schid), family = gaussian()) + 
  bf(ratio | mi() ~ 1 + meanses, family = gaussian()) + 
  set_rescor(rescor = FALSE)
fit_2 <- brm(formula = formula_2, data = mlmdata, chains = 4, cores = 4, 
           control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_2',
           backend = 'cmdstanr')

No random intercept variance in missing variable model:

formula_3 <- bf(math ~ 1 + homework + mi(ratio) + meanses + (1 + homework | schid), family = gaussian()) + 
  bf(ratio | mi() ~ 1 + meanses + (1 | schid), family = gaussian()) + 
  set_rescor(rescor = FALSE)
fit_3 <- brm(formula = formula_3, data = mlmdata, chains = 4, cores = 4, 
           control = list(adapt_delta = .999, max_treedepth = 20), file = 'fit_3',
           backend = 'cmdstanr')

Versions:

> packageVersion('brms')
[1] ‘2.17.7’
> packageVersion('cmdstanr')
[1] ‘0.5.3’
> cmdstanr::cmdstan_version()
[1] "2.30.1"

A variable that is constant across subsets of the data as defined by a group-id variable ↩︎

Topic		Replies	Views
Missing data in hierarchical linear model Modeling specification , performance	5	1932	December 3, 2019
Missing categorical group data Modeling brms , missing-data	1	540	January 22, 2023
Missing data with brms mi() brms missing-data	2	803	June 5, 2020
How to handle missing group ids in hierarchical model with group level effect brms hierarchical-model , brms , missing-data	3	64	October 7, 2024
Estimating missing response data in multivariate model brms	4	761	November 24, 2019

Subsetting missing data models for second-level predictor

Related topics