New projpred is now on CRAN

It’s now on CRAN with the latest version!

As mentioned in the, the latest release includes:

We have fully rewritten the internals in several ways. Most importantly, we now leverage maximum likelihood estimation to third parties depending on the reference model’s family. This allows a lot of flexibility and extensibility for various models. Functionality wise, the major updates since the last release are:

  • Added support for GLMMs and GAMMs via lme4 and gamm4 .
  • Formula syntax support internally that allows for easier building upon projections.
  • Thanks to the above point, we save some computation by only considering sensible projections during forward search instead of fitting every possible submodel.
  • We have added a new argument search_terms that allows the user to specify custom unit building blocks of the projections. This can be used to include fixed terms across all projections, for instance. New vignette coming up.
  • We have fully changed the way to define custom reference models. The user now provides projection fitting and prediction functions (more information in a new upcoming vignette).

More documentation and vignettes to come.


I was wondering, but neither @avehtari nor @paul.buerkner knew, can the new projpred project to a linear model if the original formula is like y ~ s(x) or does the original formula need to be more like y ~ x + s(x)?

I haven’t personally tried it out but my guess is it would work by passing search_terms = [“x”] to either varsel or cv_varsel. It would also work by hacking the formula slot in the refmodel object to say ref$formula <- y ~ x even though the reference model is a GAM.


It seems like a linear (sub)model is an important special case of a GAMM, so should we be recommending the y ~ x + s(x) form so that projpred could find linearity without any additional arguments or hacks?

Wouldn’t that suffer from high correlations as well? I’ll try projecting y ~ x + s(x) and see, but some tests that I did resulted in bad maximum likelihood estimates, because they are harder to find I believe. If the reference model incorporates and properly identifies both terms then both would be projected. Maybe what happens is that fitting y ~ s(x) already fits x + s(x)?


Hmm. It seems that y ~ x + s(x) does not sample well, even though mgcv is making a rank-deficient spline basis. But varsel(..., search_terms = list("x")) on a model with y ~ s(x) yields

Error in sub[“kl”, i] : incorrect number of dimensions

Am I doing it wrong? There does not seem to be an example of using the search_terms argument.

Yes, that was my concern, my experiments with models of the type x + s(x) didn’t sample well either. My guess is that s(x) already includes the linear unpenalized term in libraries like mgcv or gamm4, but I haven’t dig into that.

I’ll confirm tomorrow how to properly set the search_terms argument for varsel because it’s a parameter that we haven’t really used yet so it’s usage is a bit unexplored. I will add more examples and documentation for it.

Thanks for reporting!


I finally came back to this. So it happens that search_terms must include the intercept, so the correct syntax would be search_terms = c("1", "x"). This is something that is not very intuitive for the user so I’ll change it to detect whether it includes the intercept or not and include it myself automatically otherwise.

Sorry for the late response.