How to handle NA values in multivariate models?

Jack_A · October 21, 2020, 3:52am

Hi everyone,

I’m fairly new to Bayesian modelling and brms ,and this is my first time posting. My aim is to estimate whether there are among-individual correlations between an animals activity when measured in Context 1 and when measured in Context 2. Each individual has 2 repeats for Context 1 and 6 repeats for Context 2. A simplified version of my dataset is as follows:

ID   Context1.Activity   Context2.Activity   Experimental.day
1            35                 NA                 1
1            NA                 51                 2
1            14                 NA                 3
1            NA                 18                 4
2            46                 NA                 1
2            NA                 50                 2
2            21                 NA                 3
2            NA                 12                 4

To estimate the among-individual correlations I am aiming to use a multivariate model (I have excluded other predictors from the example code below for simplification).

context1.act <- bf(Context1.Activity ~ Experimental.day + (1|a|ID) , family = gaussian)
context2.act <- bf(Context2.Activity ~ Experimental.day + (1|a|ID) , family = gaussian)

Model<- brm(context1.act + context2.act + set_rescor(FALSE),
                          data = dat,
                          cores = 4,
                          chains = 4,
                          warmup = 1000,
                          iter = 10000)

However, due to the NA’s in the data I get the following error:

Rows containing NAs were excluded from the model. Error: All rows of 'data' were removed via 'subset'. Please make sure that variables do not contain NAs even in rows unused by the subsetted model. Please also make sure that each subset variable is TRUE for at least one observation.

I understand that Stan will remove any rows that contain NAs. I have seen that a possible solution to this may be using mi(). However, I’m not sure how to specify this in my case. Further, I’m still not exactly sure what mi() does here. Any help would be greatly appreciated!

Thanks in advance.

torkar · October 21, 2020, 8:22am

Hi Jack and welcome to the community!

Paul has a good vignette about this.

In short, for your example, the code would look like:

context1.act <- bf(Context1.Activity | mi() ~ Experimental.day + (1|a|ID) , family = gaussian)
context2.act <- bf(Context2.Activity | mi() ~ Experimental.day + (1|a|ID) , family = gaussian)

Model<- brm(context1.act + context2.act + set_rescor(FALSE),
                          data = dat,
                          cores = 4,
                          chains = 4,
                          warmup = 1000,
                          iter = 10000)

So NAs will be modeled instead of excluded.

Jack_A · October 21, 2020, 9:47pm

Great! Thanks very much for your help!

Jack_A · October 26, 2020, 10:15pm

Hi Richard,

Just a quick follow up. After reading the vignette, I’m still a little unsure of how the imputation works in brms. I’m wondering if you know how exactly brms imputes missing values (i.e. which values does it use to impute missing data?). Also, I’m wondering how imputation would effect the variance/estimates and therefore the among-individual correlation? If brms is using some method to impute the missing data, would this therefore reduce the estimates of variance in that variable? Hope that makes sense. Thanks in advance!

torkar · October 27, 2020, 6:52am

You have two options as @paul.buerkner writes (he might add something to this thread): 1. Multiple imputation using MICE or, 2. Handle missingness during model fitting. The first approach uses, as default, all data to try to infer the missingness, the other approach models the missingness according to your expertise, i.e., you decide explicitly how to model the missingness given your assumptions (it can be done in MICE too, but I find it a bit cumbersome).

I see it as in the first case we throw everything at it and in the second case we have some prior knowledge about how missingness came to be. I would strongly recommend you to read the 2nd edition of Statistical Rethinking where this is covered in Chapter 15 (also missingness in discrete variables is covered).

In short, for your particular case I think it’s a no-brainer: Go for imputation during model fitting since all missingness seems(?) to be in the outcome variable. The model will then infer the missingness given the posterior predictive distribution (PPD), i.e., given the PPD you can infer missing values in the outcome variable. Handling missingness in predictors is, imho, not as easy. Then one would require a model encoding your assumptions about the missingness, i.e., was the missingness introduced completely at random?

So, your question:

which values does it use to impute missing data?

is answered with: It depends on the approach you use.

Regarding your second question, I would say that very often uncertainty increases when doing imputation, but isn’t that how it should be? I feel it’s an honest representation of what we face when dealing with uncertainty.

Jack_A · October 27, 2020, 9:54pm

Great! That makes it a lot clearer. I will definitely go and read Chapter 15 of Statistical Rethinking. Thanks so much for your help!

amynang · June 18, 2025, 7:45pm

Do I understand correctly that brms will exclude a row if it contains a missing response value regardless of whether that response is relevant for a specific formula?

In the OP’s example data, NAs alternate between the two responses and it looks like all rows are excluded as a result!

Is it possible to make brms exclude rows separately for the two formulas so that it includes all non-NA focal response rows regardless of NAs in the other response and vice versa?

It never occurred to me that this is what happens; it explains why I have a hard time fitting a multivariate version of two models that fit just fine in isolation (:

Ax3man · June 18, 2025, 10:03pm

Yes, see this simple example:

library(brms)

mtcars2 <- mtcars
mtcars2$mpg[1] <- NA
mtcars2$qsec[2] <- NA

m1 <- brm(
  bf(mpg ~ 1) + bf(qsec ~ 1) + set_rescor(TRUE), 
  mtcars2, backend = 'cmdstanr'
)

now compare nrow(mtcars2): 32, with the number of observations: 30.

Two ways. First, you can use the subset argument, but you can no longer estimate residual correlations (not sure about group-level correlations?):

m2 <- brm(
  bf(mpg | subset(!is.na(mpg)) ~ 1) + 
    bf(qsec | subset(!is.na(qsec)) ~ 1) + 
    set_rescor(FALSE), 
  mtcars2, backend = 'cmdstanr'
)

or you can use mi() as described above:

m3 <- brm(
  bf(mpg | mi() ~ 1) + bf(qsec | mi() ~ 1) + set_rescor(TRUE), 
  mtcars2, backend = 'cmdstanr'
)

I checked the stand code and stan data, and there is indeed only one missing value index passed to stan for each outcome, so it will include all available data. I think this is usually the way to go.

Ax3man · June 19, 2025, 5:22pm

Should clarify this is also true when set_rescor(FALSE).

Topic		Replies	Views
Multivariate models and mi() brms	9	1556	May 15, 2019
Problem with categorical response model brms	11	2012	February 25, 2019
Estimating missing response data in multivariate model brms	4	761	November 24, 2019
Exclude NA's in brms Modeling brms	1	1155	March 18, 2021
Losing observation when modelling missing values brms	9	641	April 11, 2019

How to handle NA values in multivariate models?

Related topics