Usage and Interpreation of cv_varsel-function

@fweber144 I wanted to thank you for the informative video you shared about how to use the projpred package. I found it very helpful. Moreover I did read the following homage, which includes an example and R-Code: Projection predictive variable selection – A review and recommendations for the practicing statistician
However, I have a few questions that I was hoping you could clarify:

  1. cv_varsel and Plot Function: Using cv_varsel and the corresponding plot function, the line under the graphic provides information about the corresponding predictor from the full-data predictor ranking and the corresponding main diagonal element from the CV ranking proportions matrix. I don’t quite understand what this information conveys. My initial thought is that with a fixed number of variables, the best model is identified through projection, ensuring it includes this specific number of variables. Using cross-validation (CV), this process is repeated across different datasets, each missing one observation. The output then shows the proportion of times each variable was selected across these datasets. Could you please confirm if this is correct or provide further clarification?
  2. Interactions in Default Search: Are interactions included in the default search when using cv_varsel?
  3. Final Model in rstanarm: For the final model, if I select, for example, five predictors, is there any issue with estimating the final model with these five predictors in rstanarm? I am considering this approach to see if I can improve predictive accuracy by choosing a more flexible functional form.

Thank you in advance for your help and guidance.

Hi Mariana,

Thank you; I hope can clarify these questions.

  1. Yes, your description is essentially correct. Just to formulate a few things more precisely:
    • With a fixed number of variables (which is not possible in case of L1 search, btw), the best model is identified by first projecting the reference model onto each candidate model of that size and then selecting the candidate model with minimal “expected Kullback-Leibler divergence” from the reference model (i.e., selecting the candidate model with minimal \delta_{\mathrm{KL}}^{\mathbb{E}} = \sum_{c = 1}^{C} w_{c} \frac{1}{N} \sum_{i = 1}^{N} D_{\mathrm{KL}}\!\left( p(\tilde{y}_i|\mathcal{I}^{*}_{c}) \:\middle\|\: p(\tilde{y}_i|\boldsymbol{\theta}_{c}) \right), using the notation from Projection predictive variable selection for discrete response families with finite support | Computational Statistics and letting w_{c} denote the weight of cluster c).

    • Your statement

      Using cross-validation (CV), this process is repeated across different datasets, each missing one observation.

      is correct if “this process” means the whole search procedure (i.e., from model size 0 up to model size nterms_max), not the selection of the best model at a fixed number of predictors. Btw, “each missing one observation” refers to LOO CV, but there is also K-fold CV in projpred. You were probably aware of that, but I want to highlight this for other readers.

    • In

      The output then shows the proportion of times each variable was selected across these datasets.

      I would add “at this specific position in the predictor ranking” at the end. But this only holds for cumulate = FALSE (see Plot predictive performance — plot.vsel • projpred). In case of cumulate = TRUE, this would have to be changed to “up to this specific position in the predictor ranking” (the bottom line of the x-axis label would then also mention “cumulated” in front of “CV ranking proportions matrix”).

  2. Interactions are included in the default search if they are included in the reference model formula. (More precisely, interactions are included in the default search if they are included in argument formula of init_refmodel(), but since this usually coincides with the reference model formula (users rarely call init_refmodel() themselves), I simplified this to the reference model formula.)
  3. In general, this is undesired as the projected posterior for a given submodel (in particular, for the final submodel) retains uncertainty from the reference model’s posterior. This would be lost when working with the posterior of a refitted (final) submodel. Only when performing post-selection inference using the projected posterior of the final submodel, you will get valid post-selection inference (more precisely, we would have to say “approximately valid” because minor overfitting could occur in the selection of the final submodel size). In your case, if you are concerned about the functional form, you could consider projpred’s experimental support for GAMs (or GAMMs, if you also have a multilevel structure). For this, see the additive models in projpred: Projection predictive feature selection • projpred.
1 Like

Dear Frank,

Thank you for the helpful clarifications. I still have a few questions concerning the allowed functional forms in the reference model. You mentioned that it can include interactions; is it also possible to use the poly function to add quadratic terms? Furthermore, if I add interactions or quadratic terms, are the main effects or linear parts automatically included if the interaction or quadratic term is included?

Additionally, how should interactions be included in the formula? Should I use the interaction function, or simply use * between the terms?

Best regards, Mariana

I’ll start with the last question:

Additionally, how should interactions be included in the formula? Should I use the interaction function, or simply use * between the terms?

Simply use *. You can also use :, but then it usually makes sense to include all lower-order interaction terms (including the involved main effect terms) as well (* does this automatically).

is it also possible to use the poly function to add quadratic terms?

Yes, it is. There have been some issues with such terms, but they should be fixed, see PR #409.

Furthermore, if I add interactions or quadratic terms, are the main effects or linear parts automatically included if the interaction or quadratic term is included?

There are two possibilities what you could mean by “included”:

  • “included” in the reference model formula (more precisely, argument formula of init_refmodel(), see my last post),
  • “included” in a submodel formula (during projpred’s search).

If “included” relates to the reference model formula, this is a general question about such formula terms in R, but I hope I can answer it accurately:

A poly() term always includes all lower-degree terms (see ?poly). This is not the case when constructing the degree-wise terms manually (e.g., I(x^2)). For interactions, see my comment on * and : above.

If “included” relates to projpred’s search, note the following on poly() terms from the NEWS entry corresponding to PR #409:

Note that just like step() and MASS::stepAIC(), projpred’s search algorithms do not split up a poly() or polym() term into its lower-degree polynomial terms (which would be helpful, for example, if the linear part of a poly() term with degrees = 2 was relevant but the quadratic part not). Such a split-up of a poly() or polym() term needs to be performed manually (if desired).

So a poly() term (including all lower-degree terms) is either fully included or fully excluded during the search. You would need to construct the degree-wise terms manually if you want to allow them to be included / excluded separately. Manually constructed degree-wise terms (e.g., I(x^2)) are handled by projpred just like any other formula terms. For an interaction term, projpred ensures that all of its lower-order interaction terms (including the involved main effect terms) are selected first (in other words, an interaction term is only considered for selection if all of its lower-order interaction—or main effect—terms have already been selected).

1 Like

Thank you very much!

1 Like

Thank you again for your support, I have two questions regarding the standardization of predictors. Specifically, I’m referring to the use of the scale() function, as demonstrated in Projection predictive variable selection – A review and recommendations for the practicing statistician. (1) Do you recommend standardizing the predictors? If so, why?
(2) Does standardization affect the results from a theoretical perspective?

In https://avehtari.github.io/modelselection/bodyfat.html#2_Bodyfat_data, @avehtari writes:

Load data and scale it. Heinze, Wallisch, and Dunkler (2018) used unscaled data, but we scale it for easier comparison of the effect sizes. In theory this scaling should not have detectable difference in the predictions and I did run the results also without scaling and there is no detectable difference in practice.

Does that help?

Thanks again Frank for your helpful reply! I realized my initial question wasn’t entirely clear, so I’d like to quickly follow up.

In Projective Inference in High-Dimensional Problems: Prediction and Feature Selection, I found the L1-penalized search and relaxed projection approach particularly compelling — especially the results shown in Figure 7. I’m planning to apply L1-search in my own analysis, and I’m still uncertain about one detail: The paper cites Hastie et al. (2015), which emphasizes the importance of standardizing predictors in Lasso-based methods. Since the projection framework builds on L1-penalized search, I was wondering whether standardization is also recommended or necessary in this context — even though it’s not explicitly mentioned in the paper.

Could you briefly elaborate on that? And if there’s literature that explores this point in more detail, I’d be grateful for any recommendations.

1 Like

Based on my own experience, L1-search seems to work well regardless of whether the predictors are standardized — the results haven’t differed much either way. Still, I’d feel more confident relying on more than just anecdotal evidence.

1 Like

Thanks for the clarification. Unfortunately, I cannot answer your question. Perhaps @jpiironen, Markus Paasiniemi (I don’t know if he’s on The Stan Forums), or @avehtari can?

Normalizing the variables to have the same scale is in general very important for L1-penalized methods. Otherwise the selection will favor the variables with large native scale (because you need only a small regression coefficient for those). This applies similarly to Lasso and L1-search for projection.

I have not been involved in the development of projpred for many years, but when I coded the core parts around L1-search I was normalizing the variables automatically behind the scenes by default. The package has changed quite a bit from those days, but I had a very brief look at the current implementation, and to me it seems the normalization is still applied automatically (see the normalize=TRUE argument in glm_elnet function projpred/R/glmfun.R at 4efe82cf7b0ef78ce87b52045fc10208575d5f2d · stan-dev/projpred · GitHub). So unless you disable this (not sure if it’s possible), you will probably not see any difference whether you normalize the variables yourself or not (because the package does it anyway).

1 Like

Thanks for the hint, @jpiironen. As you can see from my ignorance (sorry for that!), I have not worked much on glm_elnet() and glm_ridge(), so essentially, they should not have changed since you worked on them (although I cannot speak for Alejandro, but I don’t think he changed there anything either).

This normalization is still applied automatically. Via the ellipsis (...) of varsel() and cv_varsel(), this can be disabled via normalize = FALSE (see also here in the docs).

1 Like

Thank you both for the clarification and the helpful sources!

1 Like