Using model comparison (loo or waic) after imputation

Ruben222 · October 19, 2021, 3:56pm

To manage missing values in my dataset, I have used imputation before any model fitting by following this vignette:

https://cran.r-project.org/web/packages/brms/vignettes/brms_missings.html#compatibility-with-other-multiple-imputation-packages

I have successfully fit the models I wished to, however I am encountering issues when I am trying to compute WAIC scores for the models.

Model <- add_criterion(Model, "waic")

R provides a warning which says (where it often crashes at this point):

Warning: Using only the first imputed data set. Please interpret the results with caution until a more principled approach has been implemented.

This reads like it is not using all the imputed datasets to calculate WAIC. Is model comparison not valid valid with imputed data? If so, then am I doing something wrong or if not, is there an alternative approach I can take.

Aside, if there is a different vignette or source that I can read more about this and understand why or why not model comparison is valid here, I would be grateful to be pointed in that direction. Basically, just any extra guidance here would be really helpful!

Thanks very much.

avehtari · October 25, 2021, 5:10pm

It’s complicated. waic can be considered as an approximation ti leave-one-out cross-validation, and it is easier to see the ,complexity when thinking cross-validation.

Do you want to cross-validate also your imputation? That is leave out part of the data also when doing the imputation. This would be often the correct thing to do. It would be very difficult to do this with waic, but trivial with K-fold-CV.
If you don’t cross-validate the imputation, but use all the data to make the imputation, and only cross-validate the models given the imputed data, you still have the complication that in multiple imputation there are many imputed data sets, and you would need to do cross-validation for each data set and then average the results. Averaging the cross-validation results over all the multiple imputation data sets would be better than using just the first imputed data set, and it might be enough for you, but it would not be the same as the option 1.
If the imputation is stable and the differences between models are big, then just using the first imputed data set can be sufficiently accurate.

Ruben222 · October 26, 2021, 8:02am

Thank you for your response. In the meantime, I have opted for approach 3, following some advice I was able to glean elsewhere.

Both approach 1 and 2 definitely sound better though. Do you have any information / tutorials that I could read to learn more about how to implement this?

avehtari · October 27, 2021, 7:00pm

This is probably the closest example Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo
It shows how to do the CV-folds, and then inside K-fold-CV loop, you would need to run brms with imputation and make the predictions averaging over the multiple models. I’m not certain whether brms supports MI predictions, but it might. Then collect the results.

Topic		Replies	Views
Model comparison for multiple imputation with brm_multiple Modeling loo , cross-validation , model-comparison , brms , missing-data	1	95	September 20, 2024
Is LOO valid for models with missing outcome when using a complete case dataset that is a subset of the original data? brms loo	8	1399	July 2, 2021
Comparing models with and without measurement error Modeling loo	2	421	September 4, 2020
Model comparison in latent variable models brms loo	1	1056	May 14, 2021
Waic or Model Comparison for a big (hierarhical) model: memory efficient methods? brms loo	7	2927	July 9, 2019

Using model comparison (loo or waic) after imputation

Related topics