Predict.brms in multivariate model with imputation

Please also provide the following information in addition to your question:

  • Operating System: Windows 10
  • brms Version: 2.6.0

I’m trying to use brms to get posterior predictions for multiple variables, some of which have missing values filled in with imputation during model fitting. However, the predict function is returning NaNs when I just run predict without specifying the response. It will return predictions for the first imputed variable, but any rows with NAs in the imputed variable get NAs in the predictions. Is this the expected behavior?

MWE:

y <- rnorm(100)
x <- ifelse(sample(c(0,1),size=100, replace = T, prob = c(.2,.8))==0, NA, rnorm(1))
z <- rnorm(100)

dat <- data.frame(x, y, z)

form1 <- bf(x | mi() ~ z)
form2 <- bf(y ~ mi(x))

mod <- brm(form1 + form2, data = dat)

newdat <- data.frame(x = ifelse(sample(c(0,1),size=100, replace = T, prob = c(.2,.8))==0, NA, rnorm(1)), z = rnorm(100))

predict(mod, newdata = newdat)

#There were 19 warnings (use warnings() to see them)

warnings()

Warning messages:
1: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
2: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
3: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
4: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
5: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
6: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
7: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
8: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
9: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
10: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
11: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
12: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
13: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
14: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
15: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
16: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
17: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
18: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced
19: In rnorm(4000L, mean = c(NA_real_, NA_real_, NA_real_, … : NAs produced

This behavior is intended although not ideal. During the sampling process, missing data are estimated in the form of additional parameters, but of course, these parameters can only be used for the original data not for new data. The current behavior is thus to just leave them NA.

For you example it is possible to predict x by z and then use the imputed values in x to predict y, but this only works if the “missing value graph” is acyclic that if it contains no circles (such as y ~ mi(x); x ~ mi(z); z ~ mi(y)).

As a result, before I implement this automatic imputation for new data, I have to add a function which checks for cycles.

Feel free to open an issue for all of this on https://github.com/paul-buerkner/brms

That makes sense. Thanks! And thanks for brms and being very responsive.