Projpred with multiple imputations (brm_multiple) - does this example work?

It seems that using projpred (and loo) with multiple imputation isn’t fully worked out (?), but there have been some useful suggestions (mainly by @avehtari) on these forums, e.g., Projpred - model with horseshoe prior does not converge and How would LOO with multiple imputations look like?

I’ve been applying these suggestions to some education data, but want to make sure I’m not doing anything overly stupid before writing a manuscript.

Below is a simplified example (e.g., 1 chain, 2 imputed datasets) of how I’ve been using projpred with brm_multiple. I would greatly appreciate any comments, suggestions, criticisms, etc.

To get started, fit a brm_multiple and run projpred varsel on each imputed fit:

library(projpred) #version

#make imputed datasets
imp <- mice(nhanes2, m = 2, seed = 1111)

#fit brm_multiple with combine = F
fit_imp <- brm_multiple(bmi ~ age + hyp + chl, data = imp, 
                        chains = 1, combine = FALSE, seed = 1111)

#varsel for each imputed fit
cvvs_imp1 <- cv_varsel(fit_imp[[1]], method = "forward", seed = 1111)
cvvs_imp2 <- cv_varsel(fit_imp[[2]], method = "forward", seed = 1111)

#solution term rankings aggregated across imputed fits
rank_imp <- ranking(cvvs_imp1)
rank_imp$foldwise <- rbind(ranking(cvvs_imp1)$foldwise,

The above seems straightforward. Next, following How would LOO with multiple imputations look like? - #2 by avehtari, find the imputation pooled elpd (and SE) for the projpred reference model and submodels.

#pooled reference model elpd (and SE)
ref_mean_pointwise_imp <- log( (exp(cvvs_imp1$summaries$ref$lppd) + exp(cvvs_imp2$summaries$ref$lppd)) / 2) 
ref_elpd_imp <- sum(ref_mean_pointwise_imp) 

loo_se <- function(x) sd(x)/sqrt(length(x-1))*length(x)
ref_elpd.se_imp <- loo_se(ref_mean_pointwise_imp) 

# example submodel: pooled size 2 submodel elpd (and SE)
size2_mean_pointwise_imp <- log( (exp(cvvs_imp1$summaries$sub[[3]]$lppd) + exp(cvvs_imp2$summaries$sub[[3]]$lppd)) / 2) 
size2_elpd_imp <- sum(size2_mean_pointwise_imp)
size2_elpd.se_imp <- loo_se(size2_mean_pointwise_imp)

Next, find elpd difference (and SE) between projpred reference and submodels (to pick suggested model size, assuming nothing is wrong in the above).

#example: diff and for size 2 submodel vs. ref model
diff_size2 <- size2_mean_pointwise_imp-ref_mean_pointwise_imp
diff_size2_elpd <- sum(diff_size2) <- loo_se(diff_size2)

Finally, I believe post-selection inference would look like the following (pretending that the suggested model size is 2 with solution terms age & chl).

fit_imp_combined <- combine_models(fit_imp[[1]], fit_imp[[2]], check_data = F)
prj <- project(fit_imp_combined, solution_terms = c("age", "chl"))

I’m not really familiar with multiple imputation, so I cannot say whether @avehtari’s suggestion of averaging log predictive density values is that easily transferable to projpred’s framework or if instead, you would have to perform the size selection separately for each imputed dataset, thereby generating a “distribution” of selected sizes, from which you decide for a single final size. My feeling is that this would better carry forward the uncertainty due to the multiple imputation, but I might be wrong.

In any case, how do you build the “full-data” predictor ranking across the imputed datasets?

In the code following comment #solution term rankings aggregated across imputed fits, you are combining the predictor rankings not only across imputed fits, but also across CV folds. That might be fine (depending on what you want to do with that), but I just wanted to point it out because this needs to be taken into account when interpreting results based on that double combination.

(Belated) Thanks for the comments, Frank!

My first instinct was to consider the distribution of suggested sizes (as you suggested), so maybe I will stick with that route. There are times when doing this averaging over imputations can yield a different suggested size (it winds up being larger than any of the individual imputation model sizes)

And yes, thanks for pointing out that the way I’m generating the rankings over the imputations includes the CV folds. Perhaps regular varsel would be sufficient given the multiple imputations (at least in cases with many imputed datasets), but I defaulted to cv_varsel (I don’t have the expertise to know if this is a good idea).

1 Like

I think it’s a good idea to stick with cv_varsel() because overfitting (in the search and in in-sample predictions) can also occur when using imputed datasets. cv_varsel() should guard against that overfitting as well as possible.