Projpred: Fixing Group Effects in Search Terms and Tips for Speed?

Firstly, thanks so much for all of the work on BRMS, STAN, and projpred, these are all fantastic tools!!! And apologies in advance if my question is daft, I am still learning about these tools and I have likely overlooked something really trivial!

Main Issues

I am running a hierarchical model, and trying to perform variable selection, I hope to use projpred. cv_varsel is taking a very long time to run, and I am always running out of memory before jobs have finished (I am using my university’s HPC, and have a memory limit of 100GB). My main questions are:

  1. Am I asking too much of projpred? My dataset contains 30,000 observations, and I am looking at 10 potential explanatory variables.
  2. Is there anything I can do, for example specifying search_terms to help things run more efficiently.
  3. How do I ensure that certain variables are always retained throughout the search (see below for more details).

Details

I have a dataset of 30,000 households. Each household is nested within a village. Each village nested within a country, i.e:

y ~ 1 + (1|village) + (1|country)

Keeping this hierarchical structure is key. I now want to now include some explanatory variables, say education level, amount of land owned etc. I have about 10 variables that I am considering, I was hoping to use projpred for variable selection. I have fitted a reference model in brms:

model <- brm(
   formula= bf(y ~ (1|village) + (1|country) + x1 + x2  + x3... x10),
   ...
)

I then take this model, load it, and run projpred. At the moment, these are the only arguments I am using:

seed <- 123
ref_model <- projpred::get_refmodel(model)

cv_varsel_res <- cv_varsel(ref_model,
                          method = 'forward', 
                          cv_method = 'kfold', 
                          K = 5,
                          verbose = TRUE, 
                          seed = seed)

save(cv_varsel_res,file="..."))

Firstly, I want to make sure that group effects are considered at every step of the search: 1 + (1|village) + (1|country). Is there any way I can do this using search_terms.

Secondly, is there anything I might be missing that could help this run more efficiently?

Thanks so much in advance for looking at this! Please let me know if you have any follow up questions!

Hi Léo,

Your questions are definitely not daft :)

First of all, 30 000 observations are quite a lot—at least compared to the datasets I have been using projpred for so far. But that doesn’t mean such big datasets shouldn’t be supported by projpred. Are you using the most recent CRAN version of projpred (2.6.0)? Because it comes with a reduced peak memory usage in K-fold CV (see Changelog • projpred).

Do I get your question concerning the multilevel terms correctly, in that you want (1 | village) and (1 | country) to be included in all submodels? In other words, should these terms be forced to be selected first? If yes, then you can achieve this through argument search_terms, as you guessed. However, there is still an open issue Fixing certain terms to be included in all submodels in cv_varsel in a high dimensional setting · Issue #346 · stan-dev/projpred · GitHub that unfortunately we haven’t made to fix yet.

Secondly, is there anything I might be missing that could help this run more efficiently?

There are many ways to speed up projpred, but you need to be careful not to become too approximate. In general, with such speed-ups, you can quickly get some results, but they are only rough and should only be considered as preliminary results. Some speed-up possibilities are:

  1. You could try cv_method = "LOO" with validate_search = FALSE (this has comparable runtime to varsel(), but accounts for some overfitting, namely that induced by varsel()'s in-sample predictions during the predictive performance evaluation—the predictor ranking is the same as in varsel()). However, for multilevel models, the Pareto-\hat{k} values are often high, meaning that the PSIS-LOO CV (which is used by projpred in the cv_method = "LOO" case) might not be reliable.

  2. You could try reducing nterms_max, but then you need to check the predictive performance plot afterwards to ensure that you are not terminating the search too early.

  3. Arguments ndraws (default NULL), nclusters (default 20), ndraws_pred (default 400), and nclusters_pred (default NULL) impact the speed, so you could try to reduce nclusters below 20 and/or set nclusters_pred to some non-NULL (and smaller than 400) value (which will then cause ndraws_pred to be ignored). Note that if you want to set nclusters_pred as low as 20, you can instead set refit_prj to FALSE (which will then be even faster), see below.

  4. You could try to set argument refit_prj to FALSE. This basically means to set ndraws_pred = ndraws and nclusters_pred = nclusters, but in a more efficient (i.e., faster) way.

  5. (In general, L1 search would be a faster alternative to forward search, but you have a multilevel reference model, so L1 search is not supported in your case.)

Finally, just a minor remark: In your code, you are specifying projpred:: for the get_refmodel() call, but not for the cv_varsel() call. Does that mean you call library(projpred) beforehand?

3 Likes

Hi Frank,

Thanks so much for your response! I’ve always been so impressed seeing how active this community is!!! I really appreciate you taking the time to write such a helpful response!

No I wasn’t actually, I have now updated the package, let’s see how that effects things. Sounds like this would be a big help, and exactly what I need.

Yes that’s exactly right. I was looking at this post. The impression I got was that I would need to do something like this:

my_search_terms <- c(
    "1",
    "(1 | iso_country_code) + (1 | iso_country_code:village)",
    "(1 | iso_country_code) + (1 | iso_country_code:village) + x1",
    "(1 | iso_country_code) + (1 | iso_country_code:village) + x2",
    "(1 | iso_country_code) + (1 | iso_country_code:village) + ...",
    "(1 | iso_country_code) + (1 | iso_country_code:village) + x10"
)

cv_varsel_res <- cv_varsel(ref_model,
                          method = 'forward', 
                          cv_method = 'kfold', 
                          K = 5,
                          verbose = TRUE, 
                          seed = seed,
                          search_terms = my_search_terms
)

Is that correct, or have I misunderstood? This seems to line up with the issue you linked to?

Thanks so much for the suggestions, I will have play around with the different options! I will let you know how I get on.

Ah yes, I see the inconsistency! Yes I do call library(projpred) beforehand, I don’t know why I have presented it like this!

Thanks again,

Léo

1 Like

This is correct only if you want to terminate the search after the selection of the first x... term (which in turn takes place after the forced selection of (1 | iso_country_code) + (1 | iso_country_code:village)). If I understood your original question correctly, you want to continue the search after that first x... term. For that, you can use the code from #346 (I have just posted a fixed and slightly modified example there).

2 Likes

Hi Frank,

Thanks so much, I see now. I have modified your solution on GitHub very slightly to account for named variables, posting here just in case it is of use to any one in the future:

get_search_terms <- function(fixed_terms, other_predictors) {
  search_terms <- unlist(lapply(1:length(other_predictors), function(m_predictors) {
    lapply(combn(other_predictors, m = m_predictors, simplify = FALSE),
           function(idxs_predictors) {
             paste0(idxs_predictors, collapse = " + ")
           })
  }))
  search_terms <- c(fixed_terms, paste(fixed_terms, "+", search_terms))
  return(search_terms)
}

variables_to_search <- c("x1", "x2","some_named_variable"...)

fixed_variables <- "(1 | iso_country_code) + (1 | iso_country_code:village)"

search_terms <- get_search_terms(fixed_variables,variables_to_search)

cv_varsel(ref_model,
                          method = 'forward', 
                          cv_method = 'kfold', 
                          K = 5,
                          verbose = TRUE, 
                          seed = seed,
                          search_terms=search_terms) 

Thanks again for all the help. I will let you know how I get on!

2 Likes

Issue #346 has now been solved on GitHub (by the addition of a helper function called force_search_terms()).

3 Likes