Accounting for measurement error during variable selection with projpred (possibly with rstanarm or brms?)

You might be able to resolve this by using the most recent GitHub version and setting options(projpred.export_to_workers = c("loo::nlist")) or (if that does not work) assigning nlist_glob <- loo::nlist and then setting options(projpred.export_to_workers = c("nlist_glob")).

I’ll check that asap.

1 Like

This works, thanks!

Is this not still making predictions on data neither the (I)SPCA nor the brms model has seen before? Edit: Related to this, is there a difference in the way varsel handles this compared to cv_varsel, given that it works even if you don’t supply a custom ref_predfun?

Yes, I’ve spent a long time thinking about this, but it does seem equivalent to a standard machine learning workflow, where one would (assuming no hyperparameters are being tuned): (i) select features within each fold of a K-fold CV loop; (ii) build a model for each fold using the selected feature subset (on the same data); and then (iii) predict on the left-out data. I assume as long as one predicts on data neither the feature selection process nor model have ever seen, the K-fold CV performance estimate should be unbiased?

The out-of-sample performance for feature selection is explored in Pavone et al (2023) for the bodyfat data (code here). They seem to nest an exact LOO-CV within a bootstrap to estimate the predicted outcome value (yfit) for each observation using a reference model constructed using all the other data. I’m not sure how this could be done within projpred – I suppose one could perform an inner K-fold CV loop within each iteration of cvfun?

I’d also welcome @avehtari’s thoughts on this!

1 Like

This was indeed a deficiency in the internal default extract_model_data function which is defined in init_refmodel(). I fixed this in the current GitHub version (branch master, see also PR #523). Thanks for the report and the good reprex!

2 Likes

It depends a bit on what you consider as data: (i) only response values, (ii) response and original predictor values, or (iii) response and ((I)S)PC-constructed predictor values.

Above, I had (iii) in mind when I mentioned the dependency between training and test data. For (iii), the brms model has indeed seen parts of this data before because the ((I)S)PC-constructed predictor values for the left-out observations include information from the training data. However, now that you are asking this, I’m realizing that this dependency might not be an issue because the theory behind projpred conditions on the predictor data (just like many other types of analyses do). So it was probably not correct from my side to think of (iii) as “data” in this case (although I didn’t do this on purpose; I was simply just thinking of “data” as the dataset and didn’t think it through).

For (i) and (ii), you would be right that this data has not been seen by the trained model when making predictions for the left-out observations.

Yes, there is a difference which explains that: Since varsel() only performs in-sample predictions, we never get !is.null(newdata) in ref_predfun, so the adjustment I suggested above for K-fold CV is irrelevant. Hence the internal default ref_predfun (which is very similar to your custom_ref_predfun) does not throw an error in the varsel() case.

Yes, for a given submodel size, the K-fold CV performance estimate should be unbiased in that case.

I don’t think this is necessary for typical projpred use-cases. In principle, an inner CV nested in an outer CV could help against the problem that the selection of the final submodel size is not cross-validated (for this, the selection of the final submodel size would have to be performed in each iteration of the outer CV loop—which is not straightforward if the size selection is a manual one), but I’m not sure how a CV nested within a bootstrap should help. Such a CV nested within a bootstrap could probably be used to assess the sampling distribution of a predictive performance (e.g., ELPD) estimator (at each submodel size separately), but I don’t think this is necessary in typical projpred use-cases. It has been a while since I read Pavone et al. (2023), so I currently cannot say more about this, sorry.

1 Like

Thanks for your response above, @fweber144, much appreciated.

Somewhat tangentially related to this: say a cvfits list is a huge object because the individual model fits are large, does it make sense to run multiple individual cv_varsel runs, each with the same reference model fit but with a different subset of the cvfits list, and then combine the results afterwards?