Projpred with multiple imputations (brm_multiple) - does this example work?

Andrew_Y · May 20, 2023, 1:32am

It seems that using projpred (and loo) with multiple imputation isn’t fully worked out (?), but there have been some useful suggestions (mainly by @avehtari) on these forums, e.g., Projpred - model with horseshoe prior does not converge and How would LOO with multiple imputations look like?

I’ve been applying these suggestions to some education data, but want to make sure I’m not doing anything overly stupid before writing a manuscript.

Below is a simplified example (e.g., 1 chain, 2 imputed datasets) of how I’ve been using projpred with brm_multiple. I would greatly appreciate any comments, suggestions, criticisms, etc.

To get started, fit a brm_multiple and run projpred varsel on each imputed fit:

library(mice)
library(brms)
library(projpred) #version 2.5.0.9000

#make imputed datasets
imp <- mice(nhanes2, m = 2, seed = 1111)

#fit brm_multiple with combine = F
fit_imp <- brm_multiple(bmi ~ age + hyp + chl, data = imp, 
                        chains = 1, combine = FALSE, seed = 1111)

#varsel for each imputed fit
cvvs_imp1 <- cv_varsel(fit_imp[[1]], method = "forward", seed = 1111)
cvvs_imp2 <- cv_varsel(fit_imp[[2]], method = "forward", seed = 1111)

#solution term rankings aggregated across imputed fits
rank_imp <- ranking(cvvs_imp1)
rank_imp$foldwise <- rbind(ranking(cvvs_imp1)$foldwise,
                           ranking(cvvs_imp2)$foldwise)
plot(rank_imp)

The above seems straightforward. Next, following How would LOO with multiple imputations look like? - #2 by avehtari, find the imputation pooled elpd (and SE) for the projpred reference model and submodels.

#pooled reference model elpd (and SE)
ref_mean_pointwise_imp <- log( (exp(cvvs_imp1$summaries$ref$lppd) + exp(cvvs_imp2$summaries$ref$lppd)) / 2) 
ref_elpd_imp <- sum(ref_mean_pointwise_imp) 

loo_se <- function(x) sd(x)/sqrt(length(x-1))*length(x)
ref_elpd.se_imp <- loo_se(ref_mean_pointwise_imp) 

# example submodel: pooled size 2 submodel elpd (and SE)
size2_mean_pointwise_imp <- log( (exp(cvvs_imp1$summaries$sub[[3]]$lppd) + exp(cvvs_imp2$summaries$sub[[3]]$lppd)) / 2) 
size2_elpd_imp <- sum(size2_mean_pointwise_imp)
size2_elpd.se_imp <- loo_se(size2_mean_pointwise_imp)

Next, find elpd difference (and SE) between projpred reference and submodels (to pick suggested model size, assuming nothing is wrong in the above).

#example: diff and diff.se for size 2 submodel vs. ref model
diff_size2 <- size2_mean_pointwise_imp-ref_mean_pointwise_imp
diff_size2_elpd <- sum(diff_size2)
diff_size2_elpd.se <- loo_se(diff_size2)

Finally, I believe post-selection inference would look like the following (pretending that the suggested model size is 2 with solution terms age & chl).

fit_imp_combined <- combine_models(fit_imp[[1]], fit_imp[[2]], check_data = F)
prj <- project(fit_imp_combined, solution_terms = c("age", "chl"))

fweber144 · May 22, 2023, 1:02pm

I’m not really familiar with multiple imputation, so I cannot say whether @avehtari’s suggestion of averaging log predictive density values is that easily transferable to projpred’s framework or if instead, you would have to perform the size selection separately for each imputed dataset, thereby generating a “distribution” of selected sizes, from which you decide for a single final size. My feeling is that this would better carry forward the uncertainty due to the multiple imputation, but I might be wrong.

In any case, how do you build the “full-data” predictor ranking across the imputed datasets?

In the code following comment #solution term rankings aggregated across imputed fits, you are combining the predictor rankings not only across imputed fits, but also across CV folds. That might be fine (depending on what you want to do with that), but I just wanted to point it out because this needs to be taken into account when interpreting results based on that double combination.

Andrew_Y · June 14, 2023, 7:23pm

(Belated) Thanks for the comments, Frank!

My first instinct was to consider the distribution of suggested sizes (as you suggested), so maybe I will stick with that route. There are times when doing this averaging over imputations can yield a different suggested size (it winds up being larger than any of the individual imputation model sizes)

And yes, thanks for pointing out that the way I’m generating the rankings over the imputations includes the CV folds. Perhaps regular varsel would be sufficient given the multiple imputations (at least in cases with many imputed datasets), but I defaulted to cv_varsel (I don’t have the expertise to know if this is a good idea).

fweber144 · June 15, 2023, 12:00pm

I think it’s a good idea to stick with cv_varsel() because overfitting (in the search and in in-sample predictions) can also occur when using imputed datasets. cv_varsel() should guard against that overfitting as well as possible.

Topic		Replies	Views
Brm_multiple object and projpred brms projpred	0	545	November 18, 2021
A few projpred questions General	1	664	May 19, 2018
Issues with c++ and brms_multiple brms	31	1545	March 9, 2021
Brm_multiple warning with fitted() brms	4	733	September 2, 2019
Brm_multiple not converging though separate brm models do brms	3	1270	May 7, 2019

Projpred with multiple imputations (brm_multiple) - does this example work?

Related topics