Projpred: selection of submodel that doesn't replicate predictive performance of reference model

Hi everyone,

First of all, thank you to the developers of this great package - Code / Program outputprojpredCode / Program output is great, as is all of the support and materials provided.

I would like to ask your opinion on the selection of submodels. In my case, no submodel achieves the predictive performance of the reference model, as seen below.

Here are the mlpd values of the submodels:

The suggested number of predictors is also the same as the reference model. However, if I wanted a sparser model, let’s say with 3 predictors, could I justify this decision with a metric or test statistic of some kind?

I have seen examples where the Bayes R2 is compared between the reference model and chosen submodel to justify the decision for a certain submodel. However, my model is categorical and therefore Bayes R2 cannot be used.

I thought of conducting posterior predictive checks (PPC) on the submodels, and then manually comparing them to the reference model. However, it seems that these checks cannot be performed on the submodels, because they are in the class ‘projection’.

Thanks so much!

Hi Melissa,

Does your reference model have 8 predictor terms?

I don’t quite understand how the two plots you show are connected: Is the upper one from cv_varsel() (and plotted with deltas = TRUE) whereas the lower is from varsel() (and plotted with deltas = FALSE)?

If my assumption above (plots coming from cv_varsel() and varsel()) is correct, then I would prefer the upper (cv_varsel()) plot for the decision for a submodel size. And if my assumption about the reference model having 8 predictors is correct, then that upper plot indeed does not indicate that any actual submodel achieves an MLPD that is comparable to the reference model’s MLPD (although for submodel size 7, it would be advantageous to see a “zoomed” plot or to provide the actual values underlying the plot—these can be obtained from summary.vsel()). With “comparable”, I mean “reference model MLPD estimate within 1 standard error (of the MLPD difference) from the submodel MLPD estimate”. However, this does not need to be the decision rule you may want to choose. For example, if you have a large number of observations, then the standard error (of the MLPD difference) can get quite small. In that case, you may want to use a different decision rule, based for example on thres_elpd = -4 or on the relative (i.e., deltas = TRUE)—and perhaps absolute (i.e., deltas = FALSE)—GMPD value (stats = "gmpd" in plot.vsel()). The GMPD makes sense here because you have a categorical response, so the GMPD is actually the geometric mean predictive probability and hence bounded by 0 and 1.

Posterior predictive checks based on a submodel resulting from a projection are described in projpred: Projection predictive feature selection • projpred. The proj_predict() function is the key function for this. However, for a response that is discrete and has finite support (like a categorical response), I would regard the observation-wise values underlying the GMPD (i.e., the predictive probabilities for the observed response categories) as a sufficient posterior(-projection) predictive check. Unfortunately, projpred does not (yet) provide an accessor function for retrieving such observation-wise performance values. If you want to take that approach, I can write some example code.