Wishlist for projpred

As a user of projpred who wished to be able to use it more, there are a couple of things I’ve been thinking about. I hope this could be of relevance for the design of the next version of projpred or for discussion in a projpred session at StanCon (I won’t be there!).

1. Definition of the starting model for selection

Currently the search always starts from a null model and builds up from there. This is not a good setup in cases in which one wishes to enrich an existing baseline model (which may contain already known confounders or covariates of value).

A way to obtain that is by using the penalty argument: one could set a penalty of 0 for the variables that should always be included, and these will be picked first. Unfortutately at the moment this works only when using the L1 search but not with forward selection.

At some point I managed to cook up a patch that allowed setting a penalty of 0 for forward selection, but it was not an elegant solution. One point to understand (and perhaps that’s why it was not implemented in the first place, @avehtari?) is the following: is there any meaning for a non-zero (and non-infinite) penalty in forward selection?

2. Opportunities for parallelism

I don’t think anything in projpred is parallelised. There are spots in which it’s almost trivial to parallelise computation, such as when projecting each of the posterior samples for the non-gaussian case (project_nongaussian in projfun.R), or running multiple cross-validation folds (kfold_varsel() in cv_varsel.R). The looping over the candidate variables for forward selection happens in the C++ code, so perhaps that’s less straightforward.

3. Feedback to users (but also testing and coverage)

This last point is of less importance, but in some way it’s the easiest to address. The package is very flexible, which allows a user to fine-tune different aspects of the algorithm. Unfortunately, this means that inputs must be checked carefully and errors should be reported to the user sooner with a helpful message rather than after a long computation with a crash or an impenetrable error.

One way to try get a handle on that is by expanding the sets of tests, so that a larger portion of the possible code paths are covered. So, in my mind, addressing this has an added benefit for developers too, as having more tests and coverage can give one some more confidence when making changes.

Hope this helps,

Marco

1 Like

Thanks $mcol for the feedback and I’m happy to know you have found the package useful. We are in progress of refactoring projpred and the work continues after @AlejandroCatalina comes back from his vacation.

  1. The new refactoring makes this easier. Also the new code understands e.g. formulas, factor type variables and multilevel models.
  2. After the refactoring we’ll look into parallelization using the similar approach as in brms and loo packages
  3. The refactoring should fix many problems, but as soon as we get alpha version out, we’ll invite you to test and report still existing problems.
2 Likes

Thanks, Aki, this is all exciting news! I’m not too familiar with how brms handles parallelization throught the future package, but rstanarm’s approach of handling cores when there are possibly nested parallel regions is nice.

Back to work again, and sorry for the long wait.

  1. As Aki mentioned, the new design should make this a lot easier. Allowing formula syntax makes almost straightforward to set a starting included terms list. All this will make a lot more sense when we release the first alpha version and you can take a direct look at some examples. Nonetheless, we haven’t implemented this option although it wouldn’t take a lot.
  2. Parallelisation is indeed a hot spot for improvements as a lot of different models could be explored at the same time.
  3. Indeed we are allowing a lot of flexibility because we are aware that not all models can be covered under the same umbrella, but of course this means that with some very customised models we don’t have a lot of information to report. I hope that the refactoring and formula syntax can also help in this regard as the formula object contains already a lot of properties of the model. We’ll look into this.
1 Like

Thanks, Alejandro!

  1. That’s cool! A starting formula seems a good approach, and even if it’s not there from the beginning, I’m happy to hear that your design is accommodating that.

  2. Parallelization is something nice to have, but correctness comes first. I raised this point because there were low-hanging fruits in the current projpred.

  3. This is indeed hard work for sometimes little visible benefit. We’ll pull forces together to get this done as well as possible.

Happy development!

2 Likes