Accounting for measurement error during variable selection with projpred (possibly with rstanarm or brms?)

fweber144 · June 24, 2025, 10:54am

You might be able to resolve this by using the most recent GitHub version and setting options(projpred.export_to_workers = c("loo::nlist")) or (if that does not work) assigning nlist_glob <- loo::nlist and then setting options(projpred.export_to_workers = c("nlist_glob")).

I’ll check that asap.

rtnliqry · June 24, 2025, 2:23pm

This works, thanks!

Is this not still making predictions on data neither the (I)SPCA nor the brms model has seen before? Edit: Related to this, is there a difference in the way varsel handles this compared to cv_varsel, given that it works even if you don’t supply a custom ref_predfun?

Yes, I’ve spent a long time thinking about this, but it does seem equivalent to a standard machine learning workflow, where one would (assuming no hyperparameters are being tuned): (i) select features within each fold of a K-fold CV loop; (ii) build a model for each fold using the selected feature subset (on the same data); and then (iii) predict on the left-out data. I assume as long as one predicts on data neither the feature selection process nor model have ever seen, the K-fold CV performance estimate should be unbiased?

The out-of-sample performance for feature selection is explored in Pavone et al (2023) for the bodyfat data (code here). They seem to nest an exact LOO-CV within a bootstrap to estimate the predicted outcome value (yfit) for each observation using a reference model constructed using all the other data. I’m not sure how this could be done within projpred – I suppose one could perform an inner K-fold CV loop within each iteration of cvfun?

I’d also welcome @avehtari’s thoughts on this!

fweber144 · June 25, 2025, 10:19am

This was indeed a deficiency in the internal default extract_model_data function which is defined in init_refmodel(). I fixed this in the current GitHub version (branch master, see also PR #523). Thanks for the report and the good reprex!

fweber144 · July 3, 2025, 7:42am

It depends a bit on what you consider as data: (i) only response values, (ii) response and original predictor values, or (iii) response and ((I)S)PC-constructed predictor values.

Above, I had (iii) in mind when I mentioned the dependency between training and test data. For (iii), the brms model has indeed seen parts of this data before because the ((I)S)PC-constructed predictor values for the left-out observations include information from the training data. However, now that you are asking this, I’m realizing that this dependency might not be an issue because the theory behind projpred conditions on the predictor data (just like many other types of analyses do). So it was probably not correct from my side to think of (iii) as “data” in this case (although I didn’t do this on purpose; I was simply just thinking of “data” as the dataset and didn’t think it through).

For (i) and (ii), you would be right that this data has not been seen by the trained model when making predictions for the left-out observations.

Yes, there is a difference which explains that: Since varsel() only performs in-sample predictions, we never get !is.null(newdata) in ref_predfun, so the adjustment I suggested above for K-fold CV is irrelevant. Hence the internal default ref_predfun (which is very similar to your custom_ref_predfun) does not throw an error in the varsel() case.

Yes, for a given submodel size, the K-fold CV performance estimate should be unbiased in that case.

I don’t think this is necessary for typical projpred use-cases. In principle, an inner CV nested in an outer CV could help against the problem that the selection of the final submodel size is not cross-validated (for this, the selection of the final submodel size would have to be performed in each iteration of the outer CV loop—which is not straightforward if the size selection is a manual one), but I’m not sure how a CV nested within a bootstrap should help. Such a CV nested within a bootstrap could probably be used to assess the sampling distribution of a predictive performance (e.g., ELPD) estimator (at each submodel size separately), but I don’t think this is necessary in typical projpred use-cases. It has been a while since I read Pavone et al. (2023), so I currently cannot say more about this, sorry.

rtnliqry · July 29, 2025, 12:38pm

Thanks for your response above, @fweber144, much appreciated.

Somewhat tangentially related to this: say a cvfits list is a huge object because the individual model fits are large, does it make sense to run multiple individual cv_varsel runs, each with the same reference model fit but with a different subset of the cvfits list, and then combine the results afterwards?

Topic		Replies	Views
Projpred with brms object General	11	1256	February 1, 2022
Variable selection with ordinal model Modeling projpred	61	4813	February 25, 2023
Errors running cv_varsel , and a search terms question Modeling projpred	14	892	June 23, 2024
Projpred::cv_varsel() returning "Not enough (non-NA) data to do anything meaningful" error brms	6	1401	November 6, 2021
Error in get_refmodel and cv_varsel rstanarm	18	1552	February 10, 2022

Accounting for measurement error during variable selection with projpred (possibly with rstanarm or brms?)

Related topics