Projection predictive variable and structure selection for GLMMs and GAMMs

avehtari · October 16, 2020, 10:03am

New paper on variable and structure selection for GLMMs and GAMMs by @AlejandroCatalina, @paul.buerkner and I: Projection predictive inference for generalized linear and additive multilevel models

Projection predictive variable selection has been shown to be stable and able to find small models with good predictive performance

Our projpred package supported before just generalized linear models. @AlejandroCatalina did great work in extending the methods to
handle also GLMMs and GAMMs and running many experiments to show that the extended method really works.

The reference can be defined with R formula syntax for GLMMs and GAMMs, forward search through the projected models finds a simpler model structure that has similar predictive performance as the full model, and we get a gnew formula.

The benefit of finding automatically a simpler model structure and formula is is that the modeller can focus more quickly on what is relevant. The forward search (or any search through the model space) is usually associated with overfitting. The projection predictive approach avoids the overfitting by examining only the projected submodels and not fitting each model independently to the data.

To be able to support R formula syntax for GLMMs and GAMMs, @AlejandroCatalina refactored the projpred. The new code supporting GLMMs and GAMMs is not yet in CRAN, but is available in master branch at https://github.com/stan-dev/projpred. See also a vignette showing examples how to use projpred for GLMMs
https://mc-stan.org/projpred/articles/quickstart_glmm.html

EDIT: correct link for the vignette

wds15 · October 16, 2020, 10:41am

Looks like great work - I am looking forward to have this on CRAN!

One Q: Is it possible to define some predictors which are always part of the model? This is useful to include domain knowledge into the models as I find. In case this is not yet there, would this be interesting and be worth an issue to follow this along?

paul.buerkner · October 16, 2020, 11:08am

I am not sure it is possible already (@AlejandroCatalina knows better) but in any case, this should not be too complicated to add.

avehtari · October 16, 2020, 11:09am

See option
search_terms: User defined list of terms to consider for selection.
(@AlejandroCatalina the help text could be more informative)

AlejandroCatalina · October 16, 2020, 11:47am

Exactly! As Aki points out, the user can set serch_terms as argument to varsel and cv_varsel. I do realise as well that the text could be more informative so I’ll work on that. This argument is expected to be a vector of small formula components that can be added. For instance, for a simple model y ~ x1 + x2 + x3, the default search_terms is ["x1", "x2", "x3"], if you want to have x1 fixed for all projections the search_terms arguments should be ["x1", "x1 + x2", "x1 + x3"]. You can do the same for multilevel models, taking into account that a term (u | g) requires the main effect u to be added so the right expression should be u + (u | g). Does this clear things up a bit?

This argument is more flexible than just fixing a component as it allows expressing any kind of search terms, so we can have a more niche argument only for fixed terms across projections that could be simpler to set up, something like fixed_terms = ["x1"] that is automatically added to every search step. Would this make more sense?

mcol · October 16, 2020, 12:31pm

Something like fixed_terms would be extemely valuable in the case when there are a large number of possible predictors (say tens or hundreds) but we want to keep some covariates fixed, as I can’t see from your description how one could concisely list all of them with search_terms.

AlejandroCatalina · October 16, 2020, 12:38pm

Yes, I realised while writing my answer that search_terms is very flexible but tedious for this particular task. Nonetheless, something like lapply(terms, function(t) paste(t, "+ ", paste(fixed_terms, collapse = " + "))) could possibly work in the meantime, assuming terms is already a vector of individual components to add to the projection.

mcol · October 16, 2020, 12:52pm

Are you saying that for y ~ f1+ x2 + x3 + x4, where we want to keep f1 fixed, it’s enough for search_term to contain ["f1", "f1 + x2", "f1 + x3", "f1 + x4"], to also allow for f1 + x2 + x3 and f1 + x2 + x4 and f1 + x3 + x4? Otherwise you can see the combinatorial explosion with the growing number of predictors.

AlejandroCatalina · October 16, 2020, 1:24pm

Yes it would be enough as search_terms are considered minimal components of the model that can be further combined, so the forward search tries all search_terms over and over, so to say, removing from it the chosen term at each iteration. Does that make sense?

When I say minimal it’s only in the case of normal variable selection, of course in the case of ["f1 + x1", "f1 + x2", ...] it’s no longer minimal as f1 can be split from those terms, but the principle is the same.

Topic		Replies	Views
Projpred 1.0.0 and paper "Projective Inference in High-dimensional Problems: Prediction and Feature Selection" Publicity	0	616	October 10, 2018
New projpred is now on CRAN Publicity projpred	7	1123	November 9, 2020
Projpred: Projection Predictive Feature Selection now in CRAN Publicity	2	803	February 20, 2018
How are sub-models fitted in Projection predictive inference (projpred)? General projpred	4	306	March 20, 2024
Variable selection with CAR (conditional autoregressive) component using the projpred package Developers gaussian-process , projpred , brms	9	664	October 2, 2022

Projection predictive variable and structure selection for GLMMs and GAMMs

Related topics