Estimating missing response data in multivariate model

Hello,
I’m attempting to fit a multivariate model with brms, with the aim of estimating (imputing) missing data for one of the response variables.
My question is, will brms use information from non-missing response variables when estimating missing values with the predict function?

Here’s my model:

bf1mi <- bf(y1 ~ (1|ID|i))
bf2mi <- bf(y2|mi() ~ (1|ID|i))

m2 <- brm(bf1mi + bf2mi,  family = gaussian(), data = dat2mi)

Where variable y1 is complete and y2 has some missing (NA) entries. variable i is just an indicator for the data rows, included so we can model correlated responses with ID (I think I’ve got that right?).

As I see it there are two ways to get estimates of the missing data.
(1) We could fit the model to all the data and estimate the random intercepts. Then use predict to estimate the missing values of y. This seems to work well in my numerical tests.
(2) We could fit the model to just to the complete observations (ie rows with both y1 and y2), then estimate the missing values of y2 based on the estimated correlation with y1. I’m attempting to do this with predict.brmsfit() but I’m not sure it is doing as I’ve intended.

I’m on:

  • Windows 10
    *brms 2.10.0

Thank you
Chris

reproducible code:

n <- 100
z <- rnorm(n, 0, 1)

dat2 <- data.frame(y1 = rnorm(n, 8*z),
                  y2 = rnorm(n, -10*z),
                  i = factor(1:n))
dat2mi <- dat2
dat2mi$y2[1:20] <- NA #make some missing entries 

bf1mi <- bf(y1 ~ (1|ID|i))
bf2mi <- bf(y2|mi() ~ (1|ID|i))

m2 <- brm(bf1mi + bf2mi, 
          family = gaussian(),
          data = dat2mi, chains = 2, cores = 2)

Hi,
on first glance, your model looks OK Also my first guess is that (1) would be generally preferable - although you won’t get any information about the correlation between y1 and y2 from the missing observations, you’ll get some information about the variability in y1. But I am no expert on this, so I’d rather check with @paul.buerkner for a second opinion.

What I am also not sure about, is whether using correlation is the best way to impute missing values. Maybe having something like y2 | mi() ~ y1 + 1||i could work better, but once again, I am not very sure about this.

Finally, you are using only two chains - is this because the model takes long to run? If so, are all the diagnostics OK?

Also, I’ve edited your post to better format source code (you can use triple backticks - ``` to do this)

Hope that helps

I would agree that (1) seems preferable.

Cheers, thanks both.
@martinmodrak 2 chains just to get a quick result for the test, I’m using more for the real data.
I like the suggestion of y2 | mi() ~ y1 + 1||i that would be a SEM right? (if we also model y1 as a second response).
Thanks,
Chris

Not sure, I’ve always found those labels confusing and never managed to learn them right, so can’t help here, sorry :-)