How to handle NA values in multivariate models?

Hi everyone,

I’m fairly new to Bayesian modelling and brms ,and this is my first time posting. My aim is to estimate whether there are among-individual correlations between an animals activity when measured in Context 1 and when measured in Context 2. Each individual has 2 repeats for Context 1 and 6 repeats for Context 2. A simplified version of my dataset is as follows:

ID   Context1.Activity   Context2.Activity
1            35                 NA                 1
1            NA                 51                 2
1            14                 NA                 3
1            NA                 18                 4
2            46                 NA                 1
2            NA                 50                 2
2            21                 NA                 3
2            NA                 12                 4

To estimate the among-individual correlations I am aiming to use a multivariate model (I have excluded other predictors from the example code below for simplification).

context1.act <- bf(Context1.Activity ~ + (1|a|ID) , family = gaussian)
context2.act <- bf(Context2.Activity ~ + (1|a|ID) , family = gaussian)

Model<- brm(context1.act + context2.act + set_rescor(FALSE),
                          data = dat,
                          cores = 4,
                          chains = 4,
                          warmup = 1000,
                          iter = 10000)

However, due to the NA’s in the data I get the following error:

Rows containing NAs were excluded from the model. Error: All rows of 'data' were removed via 'subset'. Please make sure that variables do not contain NAs even in rows unused by the subsetted model. Please also make sure that each subset variable is TRUE for at least one observation.

I understand that Stan will remove any rows that contain NAs. I have seen that a possible solution to this may be using mi(). However, I’m not sure how to specify this in my case. Further, I’m still not exactly sure what mi() does here. Any help would be greatly appreciated!

Thanks in advance.

1 Like

Hi Jack and welcome to the community!

Paul has a good vignette about this.

In short, for your example, the code would look like:

context1.act <- bf(Context1.Activity | mi() ~ + (1|a|ID) , family = gaussian)
context2.act <- bf(Context2.Activity | mi() ~ + (1|a|ID) , family = gaussian)

Model<- brm(context1.act + context2.act + set_rescor(FALSE),
                          data = dat,
                          cores = 4,
                          chains = 4,
                          warmup = 1000,
                          iter = 10000)

So NAs will be modeled instead of excluded.


Great! Thanks very much for your help!

Hi Richard,

Just a quick follow up. After reading the vignette, I’m still a little unsure of how the imputation works in brms. I’m wondering if you know how exactly brms imputes missing values (i.e. which values does it use to impute missing data?). Also, I’m wondering how imputation would effect the variance/estimates and therefore the among-individual correlation? If brms is using some method to impute the missing data, would this therefore reduce the estimates of variance in that variable? Hope that makes sense. Thanks in advance!

You have two options as @paul.buerkner writes (he might add something to this thread): 1. Multiple imputation using MICE or, 2. Handle missingness during model fitting. The first approach uses, as default, all data to try to infer the missingness, the other approach models the missingness according to your expertise, i.e., you decide explicitly how to model the missingness given your assumptions (it can be done in MICE too, but I find it a bit cumbersome).

I see it as in the first case we throw everything at it and in the second case we have some prior knowledge about how missingness came to be. I would strongly recommend you to read the 2nd edition of Statistical Rethinking where this is covered in Chapter 15 (also missingness in discrete variables is covered).

In short, for your particular case I think it’s a no-brainer: Go for imputation during model fitting since all missingness seems(?) to be in the outcome variable. The model will then infer the missingness given the posterior predictive distribution (PPD), i.e., given the PPD you can infer missing values in the outcome variable. Handling missingness in predictors is, imho, not as easy. Then one would require a model encoding your assumptions about the missingness, i.e., was the missingness introduced completely at random?

So, your question:

which values does it use to impute missing data?

is answered with: It depends on the approach you use.

Regarding your second question, I would say that very often uncertainty increases when doing imputation, but isn’t that how it should be? I feel it’s an honest representation of what we face when dealing with uncertainty.


Great! That makes it a lot clearer. I will definitely go and read Chapter 15 of Statistical Rethinking. Thanks so much for your help!

1 Like