First of all, thank you to the developers of this great package - Code / Program outputprojpredCode / Program output is great, as is all of the support and materials provided.
I would like to ask your opinion on the selection of submodels. In my case, no submodel achieves the predictive performance of the reference model, as seen below.
The suggested number of predictors is also the same as the reference model. However, if I wanted a sparser model, letās say with 3 predictors, could I justify this decision with a metric or test statistic of some kind?
I have seen examples where the Bayes R2 is compared between the reference model and chosen submodel to justify the decision for a certain submodel. However, my model is categorical and therefore Bayes R2 cannot be used.
I thought of conducting posterior predictive checks (PPC) on the submodels, and then manually comparing them to the reference model. However, it seems that these checks cannot be performed on the submodels, because they are in the class āprojectionā.
I donāt quite understand how the two plots you show are connected: Is the upper one from cv_varsel() (and plotted with deltas = TRUE) whereas the lower is from varsel() (and plotted with deltas = FALSE)?
If my assumption above (plots coming from cv_varsel() and varsel()) is correct, then I would prefer the upper (cv_varsel()) plot for the decision for a submodel size. And if my assumption about the reference model having 8 predictors is correct, then that upper plot indeed does not indicate that any actual submodel achieves an MLPD that is comparable to the reference modelās MLPD (although for submodel size 7, it would be advantageous to see a āzoomedā plot or to provide the actual values underlying the plotāthese can be obtained from summary.vsel()). With ācomparableā, I mean āreference model MLPD estimate within 1 standard error (of the MLPD difference) from the submodel MLPD estimateā. However, this does not need to be the decision rule you may want to choose. For example, if you have a large number of observations, then the standard error (of the MLPD difference) can get quite small. In that case, you may want to use a different decision rule, based for example on thres_elpd = -4 or on the relative (i.e., deltas = TRUE)āand perhaps absolute (i.e., deltas = FALSE)āGMPD value (stats = "gmpd" in plot.vsel()). The GMPD makes sense here because you have a categorical response, so the GMPD is actually the geometric mean predictive probability and hence bounded by 0 and 1.
Posterior predictive checks based on a submodel resulting from a projection are described in projpred: Projection predictive feature selection ⢠projpred. The proj_predict() function is the key function for this. However, for a response that is discrete and has finite support (like a categorical response), I would regard the observation-wise values underlying the GMPD (i.e., the predictive probabilities for the observed response categories) as a sufficient posterior(-projection) predictive check. Unfortunately, projpred does not (yet) provide an accessor function for retrieving such observation-wise performance values. If you want to take that approach, I can write some example code.
Thank you very much for your quick and helpful reply.
If my assumption above (plots coming from cv_varsel() and varsel()) is correct, then I would prefer the upper (cv_varsel()) plot for the decision for a submodel size. And if my assumption about the reference model having 8 predictors is correct, then that upper plot indeed does not indicate that any actual submodel achieves an MLPD that is comparable to the reference modelās MLPD.
Your assumptions are exactly right. Thank you for your suggestions about using different decision rules for choosing the submodel. If you donāt mind writing the code, I would love to try out retrieving the observation-wise values for the GMPD.
How are those two plots different from each other? The top one has two sets of labels on the horizontal axis and three titles, the first of which match the second plotās.
It shouldnāt be too surprising that the best model under cross-validation uses all the covariate information you have available. What people often do in this situation is penalize model complexity, so that the optimum choice might be below the full number of covariates even if the full set of covariates is better predictively.
I donāt understand the details of any of these software packages, but I agree with @fweber (and the loo package documentation) that you also need to consider uncertainty. Here, thereās not much statistical difference among 6, 7, and 8 predictors, and only small differences with 5. If youāre thinking of this like a test, then you can choose the smallest number of predictors thatās not āsignificantlyā different from the best model for some arbitrary significance threshold. Here thatād probably be in the 2ā4 range depending on how much you penalize adding more covariates. Or you can make this a decision rule and try to figure out how much better your predictions get with more covariates vs. how much more costly they are to implement (probably not a big deal with a tiny number of covariates like 10).
Paper Advances in Projection Predictive Inference tries to be a paper suitable as the first introduction to projpred software. See Section 5 in for explanation of how the plot and Section 5.5 for the details on how the uncertainty is taken into account.
The first plot shows the uncertainty in the expected predictive performance difference to the reference model (uncertainty in the difference goes to 0 when we compare the reference model to itself), and the second plot shows the uncertainty in the expected predictive performance.
Unfortunately, in the first plot, the tick labels overlap. I think that the default arguments in the plotting should be text_angle = 45, alpha = 0.1, size_position = "primary_x_top", show_cv_proportions=FALSE. With text angle 45, the predictor names are unlikely to overlap, and as CV-proportions are not easy to interpret it would be better to not show them (maybe some day @fweber144 agrees with me)
I donāt remember having objected to these things. To be more precise:
For text_angle = 45 and alpha = 0.1, I donāt remember that you asked for these to be the defaults (I also checked Slack, but I could not find any conversation where you asked for this). Sorry if you asked for this (e.g., during a video call) and I just didnāt realize. Iām fully open to have these as defaults.
For size_position = "primary_x_top", you approved PR #484 that added argument size_position: Add argument `size_position` to `plot.vsel()` by fweber144 Ā· Pull Request #484 Ā· stan-dev/projpred Ā· GitHub. I donāt remember you asking to change the default size_position to "primary_x_top" before, during or after PR #484 (and I also could not find any conversation on Slack concerning this, apart from the discussion that led to PR #484 and a few messages following it on December 8, 2023). Sorry if I missed something. Again, Iām fully open to have this as the default.
For show_cv_proportions = FALSE: Argument show_cv_proportions was implemented in PR #470, following a short conversation on Slack from November 6, 2023, where you said that you would like to have an option to not show those (percentages). I did not understand the word āoptionā in the way that show_cv_proportions = FALSE should be the default (and I still do not understand it this way). Our Slack conversation concerning this feature ends the same day with a message from me where I informed you that this has been implemented in PR #470 (to which you reacted with āthumbs upā). Iām open to having show_cv_proportions = FALSE as the default.
To update, I tried a few different approaches. For example, here are the absolute (i.e., with deltas = F) GMPD values for the submodels. As we can see here, no submodel achieves the performance of the full reference model.
When using suggest_size to determine the number of predictors for the preferred model, setting thres_elpd = -4 as suggested by @fweber144 and in Advances in Projection Predictive Inference only results in a submodel with 7 predictors, as opposed to the full reference model with 8 predictors.
As @fweber144 said, it may be that because I have a larger dataset (1739 observations) that the differences in predictive performance between the submodels are relatively small.
So to sum up, can I, for example, justify choosing a model with 3 predictors rather than 8?
For the relative ELPD values, a model with 3 predictors does approximately half as well as the full model.
For the absolute GMPD values, the normal-approximation interval for a 3-predictor submodel overlaps with that of the full reference model.
In the context of the question I am trying to answer, achieving 50% performance of the full reference model is already very attractive.
Based on the absolute-scale GMPD plot, I would say that 3 predictors with a GMPD of ca. 0.419 are not bad compared to the reference model GMPD of ca. 0.430 (\frac{0.419}{0.430} \approx 97\,\%; note that I read 0.419 and 0.430 from the plot, so you should use more precise values from summary.vsel() or performances() for final results). However, the relative-scale GMPD plot should be inspected, too. This is particularly important because statements about intervals overlapping a certain value such as
should typically be based on a relative-scale predictive performance plot, not an absolute-scale one. We have a pending PR on GitHub (#511, although it is not clear yet whether #511 or #508 is going to be merged) that combines the two scales. Perhaps such a deltas = "mixed" plot would be a good choice for you.
In your case, another argument in favor of 3 predictors might be that the predictive performance (at least measured by ELPD and GMPD, same should hold for MLPD if you plot that again) at submodel size 4 is slightly worse than at size 3.
In general, the ELPD has no absolute zero, so in general, statements such as āhalf as wellā are not possible when using the ELPD. You might have taken the intercept-only model (submodel size 0) as the lower bound, but this would require a different formulation of your statement.
P.S.: I have the code for retrieving the observation-wise values for the GMPD still on my to-do list. I just havenāt had the time yet to work on that.
data(VA, package = "MASS")
bfit <- brms::brm(
formula = cell ~ treat + age + Karn,
data = VA,
family = brms::categorical(),
### Only for the sake of speed (not recommended in general):
iter = 500,
chains = 2,
###
seed = 25569
)
library(projpred)
refm <- get_refmodel(bfit, brms_seed = 996644)
cvvs <- cv_varsel(
refm,
cv_method = "kfold", # or `cv_method = "loo"` (possibly in combination with `nloo` when using projpred's current GitHub version 2.8.0.9000)
### Only for the sake of speed (not recommended in general):
K = 2,
nclusters = 1,
refit_prj = FALSE,
###
seed = 1008792368
)
## Cross-validated predictive probabilities for the observed response categories
## (here abbreviated by "CV-PP"; these are the observation-wise contributions to
## the GMPD):
# For plots (if desired; this is just a suggestion):
library(ggplot2)
theme_set(theme_bw())
gg_cvpps <- function(cvpps_vec) {
x_label_model <- if (identical(deparse(substitute(cvpps_vec)), "cvpps_ref")) {
"reference model"
} else if (identical(deparse(substitute(cvpps_vec)), "cvpps_sub")) {
paste0("submodel of size ", nterms_sub)
} else {
stop("Unexpected `deparse(substitute(cvpps_vec))` value.")
}
return(ggplot(data = data.frame(cvpp = cvpps_vec), mapping = aes(x = cvpp)) +
scale_x_continuous(labels = scales::label_percent()) +
coord_cartesian(xlim = c(0, 1)) +
labs(x = paste0("CV-PP from ", x_label_model)) +
geom_dotplot(method = "histodot", binwidth = 0.018))
}
# Reference model:
cvlpps_ref <- cvvs[["summaries"]][["ref"]][["lppd"]]
cvpps_ref <- exp(cvlpps_ref)
print(gg_cvpps(cvpps_ref))
# Submodel of size `nterms_sub` (here, `nterms_sub <- 2` is an arbitrary choice;
# it is not based on a plot.vsel() inspection or any similar reasoning):
nterms_sub <- 2
cvlpps_sub <- cvvs[["summaries"]][["sub"]][[nterms_sub + 1]][["lppd"]]
cvpps_sub <- exp(cvlpps_sub)
print(gg_cvpps(cvpps_sub))
The objects cvpps_ref and cvpps_sub hold the desired observation-wise contributions to the GMPD for the reference model and a submodel (of size nterms_sub), respectively (I was not sure whether you need them for the reference model or a submodel, so Iāve demonstrated both cases).
It probably makes sense to investigate the observation-wise contributions to the GMPD also for simulated data (using your number of observations, your number of response categories etc.) to judge which values are good and which are not.
I wanted to report back with the plots generated from my data for others to use. To recap, these are the cross validated observation-wise contributions to the GMPD (i.e., the predictive probabilities for the observed response categories), which can be considered as a posterior predictive check for models of the class āprojectionā. I have a categorical reference model. Please note, I tried this code with an ordinal reference model and it doesnāt work, so this code would need to be customised to whatever model type you have.
Here is the plot for the categorical reference model with all eight predictors:
For this model, the suggested size based on a maximum difference in the ELPD between the reference model and submodel of -4 is seven predictors, as shown in the plot below.
Thanks for clarifying the interpretation of metrics such as the expected log predictive density (ELPD). Having (belatedly, sorry!) read some literature, I now understand why the ELPD has no absolute zero. This is because it is an estimate of how well the model predicts new data, rather than being related to the presence or absence of data (no data would mean an ELPD of 0, if I understand correctly).
To recap, in my case, we have an intercept-only model where the difference in the ELPD to the reference model is 144.79. With submodel size 3, the difference is 43.66 (43.66/144.79ā30%). Would a better formulation of a statement concerning the difference between the intercept-only model and submodel size 3 be:
Using the intercept-only model as a baseline and a reference model with eight predictors, submodel size 3 achieves approximately 70% of the performance of the reference model.
I understand the decision rules defined in your papers and the rationale behind them, and apologies if these questions are mangling the stats and maths behind these great packages. Iām just curious about how we can best interpret and use these tools in practical applications, where (unfortunately) scarce resources often mean that sparsity is preferred over accuracy.
Sorry, what do you mean by āno dataā? Furthermore, in case of a discrete response, the best possible ELPD is 0 (which corresponds to a GMPD of 1 because \mathrm{GMPD} = \exp(\mathrm{MLPD}) and \mathrm{MLPD} = \frac{\mathrm{ELPD}}{N}).
Guessing from your plots above, I think that by ādifferenceā, you are referring to the absolute difference here (at least when adhering to the projpred convention of comparing a submodel to the reference model, not the other way round). Otherwise, you would have to put a minus sign in front of these differences.
In my opinion, this formulation is mistakable (and as a sidenote, the term ābaselineā could be confused with projpredās argument baseline which has a different meaning). In your case, I think I would report the GMPD of the submodel at size 3, the GMPD of the reference model (i.e., both from deltas = FALSE) and then the GMPD ratio of the submodel at size 3 vs. the reference model (i.e., from deltas = TRUE). If you want, you can also report the GMPD of the intercept-only submodel and the GMPD ratio of this model vs. the reference model.
Sorry, what do you mean by āno dataā? Furthermore, in case of a discrete response, the best possible ELPD is 0 (which corresponds to a GMPD of 1 because GMPD=exp(MLPD) and MLPD=ELPDN).
Apologies for my wrangling of these terms - I am neither a statistician nor mathematician as you can doubt tell - what I meant is that because the elpd is an estimate of the predictive performance of the model on new data, the only way the absolute elpd would equal zero would be if we had a model with no data (which would obviously never happen).
Guessing from your plots above, I think that by ādifferenceā, you are referring to the absolute difference here (at least when adhering to the projpred convention of comparing a submodel to the reference model, not the other way round). Otherwise, you would have to put a minus sign in front of these differences.
Yes, I was referring to the differences, and should have written these values with a minus sign.
In my opinion, this formulation is mistakable (and as a sidenote, the term ābaselineā could be confused with projpredās argument baseline which has a different meaning). In your case, I think I would report the GMPD of the submodel at size 3, the GMPD of the reference model (i.e., both from deltas = FALSE) and then the GMPD ratio of the submodel at size 3 vs. the reference model (i.e., from deltas = TRUE). If you want, you can also report the GMPD of the intercept-only submodel and the GMPD ratio of this model vs. the reference model.
Thank you for this clarification, and for all your help - much appreciated!