It seems that using projpred (and loo) with multiple imputation isn’t fully worked out (?), but there have been some useful suggestions (mainly by @avehtari) on these forums, e.g., Projpred - model with horseshoe prior does not converge and How would LOO with multiple imputations look like?
I’ve been applying these suggestions to some education data, but want to make sure I’m not doing anything overly stupid before writing a manuscript.
Below is a simplified example (e.g., 1 chain, 2 imputed datasets) of how I’ve been using projpred with brm_multiple. I would greatly appreciate any comments, suggestions, criticisms, etc.
To get started, fit a brm_multiple and run projpred varsel on each imputed fit:
library(mice)
library(brms)
library(projpred) #version 2.5.0.9000
#make imputed datasets
imp <- mice(nhanes2, m = 2, seed = 1111)
#fit brm_multiple with combine = F
fit_imp <- brm_multiple(bmi ~ age + hyp + chl, data = imp,
chains = 1, combine = FALSE, seed = 1111)
#varsel for each imputed fit
cvvs_imp1 <- cv_varsel(fit_imp[[1]], method = "forward", seed = 1111)
cvvs_imp2 <- cv_varsel(fit_imp[[2]], method = "forward", seed = 1111)
#solution term rankings aggregated across imputed fits
rank_imp <- ranking(cvvs_imp1)
rank_imp$foldwise <- rbind(ranking(cvvs_imp1)$foldwise,
ranking(cvvs_imp2)$foldwise)
plot(rank_imp)
The above seems straightforward. Next, following How would LOO with multiple imputations look like? - #2 by avehtari, find the imputation pooled elpd (and SE) for the projpred reference model and submodels.
#pooled reference model elpd (and SE)
ref_mean_pointwise_imp <- log( (exp(cvvs_imp1$summaries$ref$lppd) + exp(cvvs_imp2$summaries$ref$lppd)) / 2)
ref_elpd_imp <- sum(ref_mean_pointwise_imp)
loo_se <- function(x) sd(x)/sqrt(length(x-1))*length(x)
ref_elpd.se_imp <- loo_se(ref_mean_pointwise_imp)
# example submodel: pooled size 2 submodel elpd (and SE)
size2_mean_pointwise_imp <- log( (exp(cvvs_imp1$summaries$sub[[3]]$lppd) + exp(cvvs_imp2$summaries$sub[[3]]$lppd)) / 2)
size2_elpd_imp <- sum(size2_mean_pointwise_imp)
size2_elpd.se_imp <- loo_se(size2_mean_pointwise_imp)
Next, find elpd difference (and SE) between projpred reference and submodels (to pick suggested model size, assuming nothing is wrong in the above).
#example: diff and diff.se for size 2 submodel vs. ref model
diff_size2 <- size2_mean_pointwise_imp-ref_mean_pointwise_imp
diff_size2_elpd <- sum(diff_size2)
diff_size2_elpd.se <- loo_se(diff_size2)
Finally, I believe post-selection inference would look like the following (pretending that the suggested model size is 2 with solution terms age & chl).
fit_imp_combined <- combine_models(fit_imp[[1]], fit_imp[[2]], check_data = F)
prj <- project(fit_imp_combined, solution_terms = c("age", "chl"))