New paper on variable and structure selection for GLMMs and GAMMs by @AlejandroCatalina, @paul.buerkner and I: Projection predictive inference for generalized linear and additive multilevel models

Projection predictive variable selection has been shown to be stable and able to find small models with good predictive performance

Our projpred package supported before just generalized linear models. @AlejandroCatalina did great work in extending the methods to
handle also GLMMs and GAMMs and running many experiments to show that the extended method really works.

The reference can be defined with R formula syntax for GLMMs and GAMMs, forward search through the projected models finds a simpler model structure that has similar predictive performance as the full model, and we get a gnew formula.

The benefit of finding automatically a simpler model structure and formula is is that the modeller can focus more quickly on what is relevant. The forward search (or any search through the model space) is usually associated with overfitting. The projection predictive approach avoids the overfitting by examining only the projected submodels and not fitting each model independently to the data.

Looks like great work - I am looking forward to have this on CRAN!

One Q: Is it possible to define some predictors which are always part of the model? This is useful to include domain knowledge into the models as I find. In case this is not yet there, would this be interesting and be worth an issue to follow this along?

Exactly! As Aki points out, the user can set serch_terms as argument to varsel and cv_varsel. I do realise as well that the text could be more informative so I’ll work on that. This argument is expected to be a vector of small formula components that can be added. For instance, for a simple model y ~ x1 + x2 + x3, the default search_terms is ["x1", "x2", "x3"], if you want to have x1 fixed for all projections the search_terms arguments should be ["x1", "x1 + x2", "x1 + x3"]. You can do the same for multilevel models, taking into account that a term (u | g) requires the main effect u to be added so the right expression should be u + (u | g). Does this clear things up a bit?

This argument is more flexible than just fixing a component as it allows expressing any kind of search terms, so we can have a more niche argument only for fixed terms across projections that could be simpler to set up, something like fixed_terms = ["x1"] that is automatically added to every search step. Would this make more sense?

Something like fixed_terms would be extemely valuable in the case when there are a large number of possible predictors (say tens or hundreds) but we want to keep some covariates fixed, as I can’t see from your description how one could concisely list all of them with search_terms.

Yes, I realised while writing my answer that search_terms is very flexible but tedious for this particular task. Nonetheless, something like lapply(terms, function(t) paste(t, "+ ", paste(fixed_terms, collapse = " + "))) could possibly work in the meantime, assuming terms is already a vector of individual components to add to the projection.

Are you saying that for y ~ f1+ x2 + x3 + x4, where we want to keep f1 fixed, it’s enough for search_term to contain ["f1", "f1 + x2", "f1 + x3", "f1 + x4"], to also allow for f1 + x2 + x3 and f1 + x2 + x4 and f1 + x3 + x4? Otherwise you can see the combinatorial explosion with the growing number of predictors.

Yes it would be enough as search_terms are considered minimal components of the model that can be further combined, so the forward search tries all search_terms over and over, so to say, removing from it the chosen term at each iteration. Does that make sense?

When I say minimal it’s only in the case of normal variable selection, of course in the case of ["f1 + x1", "f1 + x2", ...] it’s no longer minimal as f1 can be split from those terms, but the principle is the same.