Projpred: Fixing Group Effects in Search Terms and Tips for Speed?

Hi Léo,

Your questions are definitely not daft :)

First of all, 30 000 observations are quite a lot—at least compared to the datasets I have been using projpred for so far. But that doesn’t mean such big datasets shouldn’t be supported by projpred. Are you using the most recent CRAN version of projpred (2.6.0)? Because it comes with a reduced peak memory usage in K-fold CV (see Changelog • projpred).

Do I get your question concerning the multilevel terms correctly, in that you want (1 | village) and (1 | country) to be included in all submodels? In other words, should these terms be forced to be selected first? If yes, then you can achieve this through argument search_terms, as you guessed. However, there is still an open issue Fixing certain terms to be included in all submodels in cv_varsel in a high dimensional setting · Issue #346 · stan-dev/projpred · GitHub that unfortunately we haven’t made to fix yet.

Secondly, is there anything I might be missing that could help this run more efficiently?

There are many ways to speed up projpred, but you need to be careful not to become too approximate. In general, with such speed-ups, you can quickly get some results, but they are only rough and should only be considered as preliminary results. Some speed-up possibilities are:

  1. You could try cv_method = "LOO" with validate_search = FALSE (this has comparable runtime to varsel(), but accounts for some overfitting, namely that induced by varsel()'s in-sample predictions during the predictive performance evaluation—the predictor ranking is the same as in varsel()). However, for multilevel models, the Pareto-\hat{k} values are often high, meaning that the PSIS-LOO CV (which is used by projpred in the cv_method = "LOO" case) might not be reliable.

  2. You could try reducing nterms_max, but then you need to check the predictive performance plot afterwards to ensure that you are not terminating the search too early.

  3. Arguments ndraws (default NULL), nclusters (default 20), ndraws_pred (default 400), and nclusters_pred (default NULL) impact the speed, so you could try to reduce nclusters below 20 and/or set nclusters_pred to some non-NULL (and smaller than 400) value (which will then cause ndraws_pred to be ignored). Note that if you want to set nclusters_pred as low as 20, you can instead set refit_prj to FALSE (which will then be even faster), see below.

  4. You could try to set argument refit_prj to FALSE. This basically means to set ndraws_pred = ndraws and nclusters_pred = nclusters, but in a more efficient (i.e., faster) way.

  5. (In general, L1 search would be a faster alternative to forward search, but you have a multilevel reference model, so L1 search is not supported in your case.)

Finally, just a minor remark: In your code, you are specifying projpred:: for the get_refmodel() call, but not for the cv_varsel() call. Does that mean you call library(projpred) beforehand?

3 Likes