Projpred: selection of submodel that doesn't replicate predictive performance of reference model

Hi everyone,

First of all, thank you to the developers of this great package - Code / Program outputprojpredCode / Program output is great, as is all of the support and materials provided.

I would like to ask your opinion on the selection of submodels. In my case, no submodel achieves the predictive performance of the reference model, as seen below.

Here are the mlpd values of the submodels:

The suggested number of predictors is also the same as the reference model. However, if I wanted a sparser model, let’s say with 3 predictors, could I justify this decision with a metric or test statistic of some kind?

I have seen examples where the Bayes R2 is compared between the reference model and chosen submodel to justify the decision for a certain submodel. However, my model is categorical and therefore Bayes R2 cannot be used.

I thought of conducting posterior predictive checks (PPC) on the submodels, and then manually comparing them to the reference model. However, it seems that these checks cannot be performed on the submodels, because they are in the class ā€˜projection’.

Thanks so much!

Hi Melissa,

Does your reference model have 8 predictor terms?

I don’t quite understand how the two plots you show are connected: Is the upper one from cv_varsel() (and plotted with deltas = TRUE) whereas the lower is from varsel() (and plotted with deltas = FALSE)?

If my assumption above (plots coming from cv_varsel() and varsel()) is correct, then I would prefer the upper (cv_varsel()) plot for the decision for a submodel size. And if my assumption about the reference model having 8 predictors is correct, then that upper plot indeed does not indicate that any actual submodel achieves an MLPD that is comparable to the reference model’s MLPD (although for submodel size 7, it would be advantageous to see a ā€œzoomedā€ plot or to provide the actual values underlying the plot—these can be obtained from summary.vsel()). With ā€œcomparableā€, I mean ā€œreference model MLPD estimate within 1 standard error (of the MLPD difference) from the submodel MLPD estimateā€. However, this does not need to be the decision rule you may want to choose. For example, if you have a large number of observations, then the standard error (of the MLPD difference) can get quite small. In that case, you may want to use a different decision rule, based for example on thres_elpd = -4 or on the relative (i.e., deltas = TRUE)—and perhaps absolute (i.e., deltas = FALSE)—GMPD value (stats = "gmpd" in plot.vsel()). The GMPD makes sense here because you have a categorical response, so the GMPD is actually the geometric mean predictive probability and hence bounded by 0 and 1.

Posterior predictive checks based on a submodel resulting from a projection are described in projpred: Projection predictive feature selection • projpred. The proj_predict() function is the key function for this. However, for a response that is discrete and has finite support (like a categorical response), I would regard the observation-wise values underlying the GMPD (i.e., the predictive probabilities for the observed response categories) as a sufficient posterior(-projection) predictive check. Unfortunately, projpred does not (yet) provide an accessor function for retrieving such observation-wise performance values. If you want to take that approach, I can write some example code.

Hi Frank,

Thank you very much for your quick and helpful reply.

If my assumption above (plots coming from cv_varsel() and varsel()) is correct, then I would prefer the upper (cv_varsel()) plot for the decision for a submodel size. And if my assumption about the reference model having 8 predictors is correct, then that upper plot indeed does not indicate that any actual submodel achieves an MLPD that is comparable to the reference model’s MLPD.

Your assumptions are exactly right. Thank you for your suggestions about using different decision rules for choosing the submodel. If you don’t mind writing the code, I would love to try out retrieving the observation-wise values for the GMPD.

1 Like

How are those two plots different from each other? The top one has two sets of labels on the horizontal axis and three titles, the first of which match the second plot’s.

It shouldn’t be too surprising that the best model under cross-validation uses all the covariate information you have available. What people often do in this situation is penalize model complexity, so that the optimum choice might be below the full number of covariates even if the full set of covariates is better predictively.

I don’t understand the details of any of these software packages, but I agree with @fweber (and the loo package documentation) that you also need to consider uncertainty. Here, there’s not much statistical difference among 6, 7, and 8 predictors, and only small differences with 5. If you’re thinking of this like a test, then you can choose the smallest number of predictors that’s not ā€œsignificantlyā€ different from the best model for some arbitrary significance threshold. Here that’d probably be in the 2–4 range depending on how much you penalize adding more covariates. Or you can make this a decision rule and try to figure out how much better your predictions get with more covariates vs. how much more costly they are to implement (probably not a big deal with a tiny number of covariates like 10).

Paper Advances in Projection Predictive Inference tries to be a paper suitable as the first introduction to projpred software. See Section 5 in for explanation of how the plot and Section 5.5 for the details on how the uncertainty is taken into account.

The first plot shows the uncertainty in the expected predictive performance difference to the reference model (uncertainty in the difference goes to 0 when we compare the reference model to itself), and the second plot shows the uncertainty in the expected predictive performance.

Unfortunately, in the first plot, the tick labels overlap. I think that the default arguments in the plotting should be text_angle = 45, alpha = 0.1, size_position = "primary_x_top", show_cv_proportions=FALSE. With text angle 45, the predictor names are unlikely to overlap, and as CV-proportions are not easy to interpret it would be better to not show them (maybe some day @fweber144 agrees with me)

I don’t remember having objected to these things. To be more precise:

  • For text_angle = 45 and alpha = 0.1, I don’t remember that you asked for these to be the defaults (I also checked Slack, but I could not find any conversation where you asked for this). Sorry if you asked for this (e.g., during a video call) and I just didn’t realize. I’m fully open to have these as defaults.
  • For size_position = "primary_x_top", you approved PR #484 that added argument size_position: Add argument `size_position` to `plot.vsel()` by fweber144 Ā· Pull Request #484 Ā· stan-dev/projpred Ā· GitHub. I don’t remember you asking to change the default size_position to "primary_x_top" before, during or after PR #484 (and I also could not find any conversation on Slack concerning this, apart from the discussion that led to PR #484 and a few messages following it on December 8, 2023). Sorry if I missed something. Again, I’m fully open to have this as the default.
  • For show_cv_proportions = FALSE: Argument show_cv_proportions was implemented in PR #470, following a short conversation on Slack from November 6, 2023, where you said that you would like to have an option to not show those (percentages). I did not understand the word ā€œoptionā€ in the way that show_cv_proportions = FALSE should be the default (and I still do not understand it this way). Our Slack conversation concerning this feature ends the same day with a message from me where I informed you that this has been implemented in PR #470 (to which you reacted with ā€œthumbs upā€). I’m open to having show_cv_proportions = FALSE as the default.

I made an issue about the plot.vsel defaults plot.vsel defaults Ā· Issue #517 Ā· stan-dev/projpred Ā· GitHub

1 Like

Thanks everyone for their input.

To update, I tried a few different approaches. For example, here are the absolute (i.e., with deltas = F) GMPD values for the submodels. As we can see here, no submodel achieves the performance of the full reference model.

Here are the ELPD values with deltas = T. Again, no submodel achieves the reference model’s performance.

When using suggest_size to determine the number of predictors for the preferred model, setting thres_elpd = -4 as suggested by @fweber144 and in Advances in Projection Predictive Inference only results in a submodel with 7 predictors, as opposed to the full reference model with 8 predictors.

As @fweber144 said, it may be that because I have a larger dataset (1739 observations) that the differences in predictive performance between the submodels are relatively small.

So to sum up, can I, for example, justify choosing a model with 3 predictors rather than 8?

  • For the relative ELPD values, a model with 3 predictors does approximately half as well as the full model.
  • For the absolute GMPD values, the normal-approximation interval for a 3-predictor submodel overlaps with that of the full reference model.

In the context of the question I am trying to answer, achieving 50% performance of the full reference model is already very attractive.

Thanks for the additional plots and information.

Based on the absolute-scale GMPD plot, I would say that 3 predictors with a GMPD of ca. 0.419 are not bad compared to the reference model GMPD of ca. 0.430 (\frac{0.419}{0.430} \approx 97\,\%; note that I read 0.419 and 0.430 from the plot, so you should use more precise values from summary.vsel() or performances() for final results). However, the relative-scale GMPD plot should be inspected, too. This is particularly important because statements about intervals overlapping a certain value such as

should typically be based on a relative-scale predictive performance plot, not an absolute-scale one. We have a pending PR on GitHub (#511, although it is not clear yet whether #511 or #508 is going to be merged) that combines the two scales. Perhaps such a deltas = "mixed" plot would be a good choice for you.

In your case, another argument in favor of 3 predictors might be that the predictive performance (at least measured by ELPD and GMPD, same should hold for MLPD if you plot that again) at submodel size 4 is slightly worse than at size 3.

In general, the ELPD has no absolute zero, so in general, statements such as ā€œhalf as wellā€ are not possible when using the ELPD. You might have taken the intercept-only model (submodel size 0) as the lower bound, but this would require a different formulation of your statement.

P.S.: I have the code for retrieving the observation-wise values for the GMPD still on my to-do list. I just haven’t had the time yet to work on that.

Here is an example:

data(VA, package = "MASS")
bfit <- brms::brm(
  formula = cell ~ treat + age + Karn,
  data = VA,
  family = brms::categorical(),
  ### Only for the sake of speed (not recommended in general):
  iter = 500,
  chains = 2,
  ###
  seed = 25569
)

library(projpred)
refm <- get_refmodel(bfit, brms_seed = 996644)
cvvs <- cv_varsel(
  refm,
  cv_method = "kfold", # or `cv_method = "loo"` (possibly in combination with `nloo` when using projpred's current GitHub version 2.8.0.9000)
  ### Only for the sake of speed (not recommended in general):
  K = 2,
  nclusters = 1,
  refit_prj = FALSE,
  ###
  seed = 1008792368
)

## Cross-validated predictive probabilities for the observed response categories
## (here abbreviated by "CV-PP"; these are the observation-wise contributions to
## the GMPD):

# For plots (if desired; this is just a suggestion):
library(ggplot2)
theme_set(theme_bw())
gg_cvpps <- function(cvpps_vec) {
  x_label_model <- if (identical(deparse(substitute(cvpps_vec)), "cvpps_ref")) {
    "reference model"
  } else if (identical(deparse(substitute(cvpps_vec)), "cvpps_sub")) {
    paste0("submodel of size ", nterms_sub)
  } else {
    stop("Unexpected `deparse(substitute(cvpps_vec))` value.")
  }
  return(ggplot(data = data.frame(cvpp = cvpps_vec), mapping = aes(x = cvpp)) +
           scale_x_continuous(labels = scales::label_percent()) +
           coord_cartesian(xlim = c(0, 1)) +
           labs(x = paste0("CV-PP from ", x_label_model)) +
           geom_dotplot(method = "histodot", binwidth = 0.018))
}

# Reference model:
cvlpps_ref <- cvvs[["summaries"]][["ref"]][["lppd"]]
cvpps_ref <- exp(cvlpps_ref)
print(gg_cvpps(cvpps_ref))

# Submodel of size `nterms_sub` (here, `nterms_sub <- 2` is an arbitrary choice;
# it is not based on a plot.vsel() inspection or any similar reasoning):
nterms_sub <- 2
cvlpps_sub <- cvvs[["summaries"]][["sub"]][[nterms_sub + 1]][["lppd"]]
cvpps_sub <- exp(cvlpps_sub)
print(gg_cvpps(cvpps_sub))

The objects cvpps_ref and cvpps_sub hold the desired observation-wise contributions to the GMPD for the reference model and a submodel (of size nterms_sub), respectively (I was not sure whether you need them for the reference model or a submodel, so I’ve demonstrated both cases).

It probably makes sense to investigate the observation-wise contributions to the GMPD also for simulated data (using your number of observations, your number of response categories etc.) to judge which values are good and which are not.

Thank you very much @fweber144, much appreciated.

I wanted to report back with the plots generated from my data for others to use. To recap, these are the cross validated observation-wise contributions to the GMPD (i.e., the predictive probabilities for the observed response categories), which can be considered as a posterior predictive check for models of the class ā€˜projection’. I have a categorical reference model. Please note, I tried this code with an ordinal reference model and it doesn’t work, so this code would need to be customised to whatever model type you have.

Here is the plot for the categorical reference model with all eight predictors:

For this model, the suggested size based on a maximum difference in the ELPD between the reference model and submodel of -4 is seven predictors, as shown in the plot below.

We can see minimal difference between the two plots, which is not the case for a submodel with three predictors, as shown below.

Thanks for clarifying the interpretation of metrics such as the expected log predictive density (ELPD). Having (belatedly, sorry!) read some literature, I now understand why the ELPD has no absolute zero. This is because it is an estimate of how well the model predicts new data, rather than being related to the presence or absence of data (no data would mean an ELPD of 0, if I understand correctly).

To recap, in my case, we have an intercept-only model where the difference in the ELPD to the reference model is 144.79. With submodel size 3, the difference is 43.66 (43.66/144.79ā‰ˆ30%). Would a better formulation of a statement concerning the difference between the intercept-only model and submodel size 3 be:

Using the intercept-only model as a baseline and a reference model with eight predictors, submodel size 3 achieves approximately 70% of the performance of the reference model.

I understand the decision rules defined in your papers and the rationale behind them, and apologies if these questions are mangling the stats and maths behind these great packages. I’m just curious about how we can best interpret and use these tools in practical applications, where (unfortunately) scarce resources often mean that sparsity is preferred over accuracy.

1 Like

Sorry, what do you mean by ā€œno dataā€? Furthermore, in case of a discrete response, the best possible ELPD is 0 (which corresponds to a GMPD of 1 because \mathrm{GMPD} = \exp(\mathrm{MLPD}) and \mathrm{MLPD} = \frac{\mathrm{ELPD}}{N}).

Guessing from your plots above, I think that by ā€œdifferenceā€, you are referring to the absolute difference here (at least when adhering to the projpred convention of comparing a submodel to the reference model, not the other way round). Otherwise, you would have to put a minus sign in front of these differences.

In my opinion, this formulation is mistakable (and as a sidenote, the term ā€œbaselineā€ could be confused with projpred’s argument baseline which has a different meaning). In your case, I think I would report the GMPD of the submodel at size 3, the GMPD of the reference model (i.e., both from deltas = FALSE) and then the GMPD ratio of the submodel at size 3 vs. the reference model (i.e., from deltas = TRUE). If you want, you can also report the GMPD of the intercept-only submodel and the GMPD ratio of this model vs. the reference model.

Sorry, what do you mean by ā€œno dataā€? Furthermore, in case of a discrete response, the best possible ELPD is 0 (which corresponds to a GMPD of 1 because GMPD=exp(MLPD) and MLPD=ELPDN).

Apologies for my wrangling of these terms - I am neither a statistician nor mathematician as you can doubt tell - what I meant is that because the elpd is an estimate of the predictive performance of the model on new data, the only way the absolute elpd would equal zero would be if we had a model with no data (which would obviously never happen).

Guessing from your plots above, I think that by ā€œdifferenceā€, you are referring to the absolute difference here (at least when adhering to the projpred convention of comparing a submodel to the reference model, not the other way round). Otherwise, you would have to put a minus sign in front of these differences.

Yes, I was referring to the differences, and should have written these values with a minus sign.

In my opinion, this formulation is mistakable (and as a sidenote, the term ā€œbaselineā€ could be confused with projpred’s argument baseline which has a different meaning). In your case, I think I would report the GMPD of the submodel at size 3, the GMPD of the reference model (i.e., both from deltas = FALSE) and then the GMPD ratio of the submodel at size 3 vs. the reference model (i.e., from deltas = TRUE). If you want, you can also report the GMPD of the intercept-only submodel and the GMPD ratio of this model vs. the reference model.

Thank you for this clarification, and for all your help - much appreciated!

1 Like