How to handle NA values in multivariate models?

You have two options as @paul.buerkner writes (he might add something to this thread): 1. Multiple imputation using MICE or, 2. Handle missingness during model fitting. The first approach uses, as default, all data to try to infer the missingness, the other approach models the missingness according to your expertise, i.e., you decide explicitly how to model the missingness given your assumptions (it can be done in MICE too, but I find it a bit cumbersome).

I see it as in the first case we throw everything at it and in the second case we have some prior knowledge about how missingness came to be. I would strongly recommend you to read the 2nd edition of Statistical Rethinking where this is covered in Chapter 15 (also missingness in discrete variables is covered).

In short, for your particular case I think it’s a no-brainer: Go for imputation during model fitting since all missingness seems(?) to be in the outcome variable. The model will then infer the missingness given the posterior predictive distribution (PPD), i.e., given the PPD you can infer missing values in the outcome variable. Handling missingness in predictors is, imho, not as easy. Then one would require a model encoding your assumptions about the missingness, i.e., was the missingness introduced completely at random?

So, your question:

which values does it use to impute missing data?

is answered with: It depends on the approach you use.

Regarding your second question, I would say that very often uncertainty increases when doing imputation, but isn’t that how it should be? I feel it’s an honest representation of what we face when dealing with uncertainty.

3 Likes