How to use mi() + subset() + index()

I just can’t wrap my head around how to properly do this with brms.

In the help documentation for mi() we have the example:

data("nhanes", package = "mice")
N <- nrow(nhanes)
# ... #
# 'mi' terms can also be used when some responses are subsetted
nhanes$sub <- TRUE
nhanes$sub[1:2] <- FALSE
nhanes$id <- 1:N
nhanes$idx <- sample(3:N, N, TRUE)

# this requires the addition term 'index' being specified
# in the subsetted part of the model
bform3 <- bf(bmi | mi() ~ age * mi(chl, idx)) +
  bf(chl | mi(se) + subset(sub) + index(id) ~ age) +
  set_rescor(FALSE)

But the idx column seems arbitrary to me. Why is it just a bunch of random numbers between 3 and 10? What does the model actually see?

I’m interested in this because I have a response variable (y_est in reprex below) which I know is an underestimate of the real value, because for a small subset of individuals I have a much better estimate (“y_true”), and because there is a prior expectation that the measurement method we use has a downward bias. However, for some of the individuals with y_true measures I’m not interested in how it varies with our predictors of interest (they’re different but related species). So I’d like to use all the available information to impute (read “bias correct”) y_est, and then use a subset of the imputed values to test for correlations with the x variables.

For good measure, I also have different groups for which I expect the corelations to vary. This isn’t necessary for the reprex but it’s closer to my actual data.

Here a reprex:

library(tidyverse)

set.seed(78)

N <- 100
b1_est_grpA <- -10
b1_est_grpB <- -7
b1_diff <- b1_est_grpA - b1_est_grpB
y_bias <- 5

n_exclude <- 5
n_true <- 10

group <- c(letters[1:n_exclude],
           rep("grpA",floor((n_true-n_exclude)/2)),
           rep("grpB",ceiling((n_true-n_exclude)/2)),
           rep("grpA",floor((N-n_true)/2)),
           rep("grpB",ceiling((N-n_true)/2)))

grpA_b0 <- 75
grpB_b0 <- 50
other_b0 <- mean(c(grpA_b0,grpB_b0))

b0_diff <- grpA_b0 - grpB_b0
is_grpA <- as.numeric(group == "grpA")

x <- runif(N)
y_est <- grpB_b0 + is_grpA*b0_diff +    # varying intercept by group
  (b1_est_grpB + is_grpA*b1_diff)*x +   # varying slope by group
  rnorm(N,0,1.5)                        # error
y_est[1:n_exclude] <- rnorm(n_exclude,other_b0,10) # change the excluded ones to some unrelated values
plot(y_est~x)
points(y_est[1:n_exclude] ~ x[1:n_exclude],col="red",pch=16)

# "true" y is available for some samples, with a bit of error (sigma = 10)
y_true <- c( (y_bias*y_est[1:n_true] + rnorm(n_true,0,10)),
             rep(NA,N-n_true))

d <- data.frame(x=x,y_est=y_est,y_true=y_true,group=group)
d$is_ingroup<-d$group %in% c("grpA","grpB")
d$has_true<- !is.na(d$y_true)

head(d,15)

plot(y_true ~ y_est, data = d)
points(y_true ~ y_est, data = d[!d$is_ingroup,],col="red",pch=16)
cor.test(y_est,y_true)

## Model ----
## We want to 
## 1. predict missing y_true based on y_est, then
## 2  model y_true ~ 1 + x + (1 + x | group) but 
##      *only* for the observations in the "ingroup"

library(brms)
library(cmdstanr)
set_cmdstan_path("~/.cmdstan/cmdstan-2.35.0")

# We can think of y_est as a function of y_true
# Can we use mi() and subset() to get the model to look at the data of interest?

d$idx <- 1:nrow(d)
m4 <- brm(
  bf(y_est ~ mi(y_true,idx = idx)) +
    bf(y_true | mi() + subset(is_ingroup) + index(idx) ~ x*group) +
    set_rescor(FALSE),
  d,
  backend = "cmdstanr",
  cores=4
)

# Above returns the error:
# Error: Could not match all indices in response 'ytrue'.

Thanks in advance!

  • Operating System: Windows 11
  • brms Version: 2.22.0

Unfortunately, I don’t have an answer. I’d be curious to know what the resp_index() function does, however. I’ve never seen it used, and the documentation in the brms reference manual is not very informative.

Hi Solomon, thanks for your response!

I’ve tried various other options (dumped below) but so far nothing achieves what I’d hoped. However, I wonder if this is the appropriate use of the subset() + index() syntax, specifically because the “target” of the second formula isn’t a subset of that of the first. This means that when trying to match the indices of formula 1 to those of formula 2 we have NAs.

The distributed nature of the brms source code makes it hard to do effective sleuthing (at least for me!), but I think this is relevant: brms/R/data-predictor.R at f024203a3df26206a4a281c209bbdd11ab43c94f · paul-buerkner/brms · GitHub

I keep wondering if I’m missing something blindingly obvious…

Herewith dump of things that didn’t work:

# Try something else...
d$id <- 1:nrow(d)
d$idx <- cumsum(d$is_ingroup)
m5 <- brm(
  bf(y_est ~ mi(y_true,idx = idx)) +
    bf(y_true | mi() + subset(is_ingroup) + index(id) ~ x*group) +
    set_rescor(FALSE),
  d,
  backend = "cmdstanr",
  cores=4
)
# doesn't work


## Try with weights?

d$weight_ingroup<-as.numeric(d$is_ingroup)
m6 <- brm(
  bf(y_est ~ mi(y_true)) +
    bf(y_true | mi() + weights(weight_ingroup) ~ x*group) +
    set_rescor(FALSE),
  d,
  backend = "cmdstanr",
  cores=4,
  drop_unused_levels = FALSE
)
m6
plot(conditional_effects(m6),points=T)
# This doesn't work because the model still sees the outgroup's group IDs

d$weight_has_true<-as.numeric(d$has_true)
d$group_in <- ifelse(d$group %in% c("grpA","grpB"),d$group,NA) 
m7 <- brm(
  bf(y_est | weights(weight_has_true) ~ mi(y_true)) +
    bf(y_true | mi() + weights(weight_ingroup) ~ x*group_in) +
    set_rescor(FALSE),
  d,
  backend = "cmdstanr",
  cores=4,
  drop_unused_levels = FALSE
)
m7
plot(conditional_effects(m7),points=T)

# try mixing subset and weights
m8 <- brm(
  bf(y_est | subset(weight_has_true) ~ mi(y_true,idx = id)) +
    bf(y_true | mi() + weights(weight_ingroup) ~ x*group_in) +
    set_rescor(FALSE),
  d,
  backend = "cmdstanr",
  cores=4,
  drop_unused_levels = FALSE
)
# syntax errors again (y_true needs index())