Hi Léo,
Your questions are definitely not daft :)
First of all, 30 000 observations are quite a lot—at least compared to the datasets I have been using projpred for so far. But that doesn’t mean such big datasets shouldn’t be supported by projpred. Are you using the most recent CRAN version of projpred (2.6.0)? Because it comes with a reduced peak memory usage in K-fold CV (see Changelog • projpred).
Do I get your question concerning the multilevel terms correctly, in that you want (1 | village)
and (1 | country)
to be included in all submodels? In other words, should these terms be forced to be selected first? If yes, then you can achieve this through argument search_terms
, as you guessed. However, there is still an open issue Fixing certain terms to be included in all submodels in cv_varsel in a high dimensional setting · Issue #346 · stan-dev/projpred · GitHub that unfortunately we haven’t made to fix yet.
Secondly, is there anything I might be missing that could help this run more efficiently?
There are many ways to speed up projpred, but you need to be careful not to become too approximate. In general, with such speed-ups, you can quickly get some results, but they are only rough and should only be considered as preliminary results. Some speed-up possibilities are:
-
You could try
cv_method = "LOO"
withvalidate_search = FALSE
(this has comparable runtime tovarsel()
, but accounts for some overfitting, namely that induced byvarsel()
's in-sample predictions during the predictive performance evaluation—the predictor ranking is the same as invarsel()
). However, for multilevel models, the Pareto-\hat{k} values are often high, meaning that the PSIS-LOO CV (which is used by projpred in thecv_method = "LOO"
case) might not be reliable. -
You could try reducing
nterms_max
, but then you need to check the predictive performance plot afterwards to ensure that you are not terminating the search too early. -
Arguments
ndraws
(defaultNULL
),nclusters
(default20
),ndraws_pred
(default400
), andnclusters_pred
(defaultNULL
) impact the speed, so you could try to reducenclusters
below20
and/or setnclusters_pred
to some non-NULL
(and smaller than400
) value (which will then causendraws_pred
to be ignored). Note that if you want to setnclusters_pred
as low as20
, you can instead setrefit_prj
toFALSE
(which will then be even faster), see below. -
You could try to set argument
refit_prj
toFALSE
. This basically means to setndraws_pred = ndraws
andnclusters_pred = nclusters
, but in a more efficient (i.e., faster) way. -
(In general, L1 search would be a faster alternative to forward search, but you have a multilevel reference model, so L1 search is not supported in your case.)
Finally, just a minor remark: In your code, you are specifying projpred::
for the get_refmodel()
call, but not for the cv_varsel()
call. Does that mean you call library(projpred)
beforehand?