Projpred: Fixing Group Effects in Search Terms and Tips for Speed?

Firstly, thanks so much for all of the work on BRMS, STAN, and projpred, these are all fantastic tools!!! And apologies in advance if my question is daft, I am still learning about these tools and I have likely overlooked something really trivial!

Main Issues

I am running a hierarchical model, and trying to perform variable selection, I hope to use projpred. cv_varsel is taking a very long time to run, and I am always running out of memory before jobs have finished (I am using my university’s HPC, and have a memory limit of 100GB). My main questions are:

  1. Am I asking too much of projpred? My dataset contains 30,000 observations, and I am looking at 10 potential explanatory variables.
  2. Is there anything I can do, for example specifying search_terms to help things run more efficiently.
  3. How do I ensure that certain variables are always retained throughout the search (see below for more details).

Details

I have a dataset of 30,000 households. Each household is nested within a village. Each village nested within a country, i.e:

y ~ 1 + (1|village) + (1|country)

Keeping this hierarchical structure is key. I now want to now include some explanatory variables, say education level, amount of land owned etc. I have about 10 variables that I am considering, I was hoping to use projpred for variable selection. I have fitted a reference model in brms:

model <- brm(
   formula= bf(y ~ (1|village) + (1|country) + x1 + x2  + x3... x10),
   ...
)

I then take this model, load it, and run projpred. At the moment, these are the only arguments I am using:

seed <- 123
ref_model <- projpred::get_refmodel(model)

cv_varsel_res <- cv_varsel(ref_model,
                          method = 'forward', 
                          cv_method = 'kfold', 
                          K = 5,
                          verbose = TRUE, 
                          seed = seed)

save(cv_varsel_res,file="..."))

Firstly, I want to make sure that group effects are considered at every step of the search: 1 + (1|village) + (1|country). Is there any way I can do this using search_terms.

Secondly, is there anything I might be missing that could help this run more efficiently?

Thanks so much in advance for looking at this! Please let me know if you have any follow up questions!

Hi Léo,

Your questions are definitely not daft :)

First of all, 30 000 observations are quite a lot—at least compared to the datasets I have been using projpred for so far. But that doesn’t mean such big datasets shouldn’t be supported by projpred. Are you using the most recent CRAN version of projpred (2.6.0)? Because it comes with a reduced peak memory usage in K-fold CV (see Changelog • projpred).

Do I get your question concerning the multilevel terms correctly, in that you want (1 | village) and (1 | country) to be included in all submodels? In other words, should these terms be forced to be selected first? If yes, then you can achieve this through argument search_terms, as you guessed. However, there is still an open issue Fixing certain terms to be included in all submodels in cv_varsel in a high dimensional setting · Issue #346 · stan-dev/projpred · GitHub that unfortunately we haven’t made to fix yet.

Secondly, is there anything I might be missing that could help this run more efficiently?

There are many ways to speed up projpred, but you need to be careful not to become too approximate. In general, with such speed-ups, you can quickly get some results, but they are only rough and should only be considered as preliminary results. Some speed-up possibilities are:

  1. You could try cv_method = "LOO" with validate_search = FALSE (this has comparable runtime to varsel(), but accounts for some overfitting, namely that induced by varsel()'s in-sample predictions during the predictive performance evaluation—the predictor ranking is the same as in varsel()). However, for multilevel models, the Pareto-\hat{k} values are often high, meaning that the PSIS-LOO CV (which is used by projpred in the cv_method = "LOO" case) might not be reliable.

  2. You could try reducing nterms_max, but then you need to check the predictive performance plot afterwards to ensure that you are not terminating the search too early.

  3. Arguments ndraws (default NULL), nclusters (default 20), ndraws_pred (default 400), and nclusters_pred (default NULL) impact the speed, so you could try to reduce nclusters below 20 and/or set nclusters_pred to some non-NULL (and smaller than 400) value (which will then cause ndraws_pred to be ignored). Note that if you want to set nclusters_pred as low as 20, you can instead set refit_prj to FALSE (which will then be even faster), see below.

  4. You could try to set argument refit_prj to FALSE. This basically means to set ndraws_pred = ndraws and nclusters_pred = nclusters, but in a more efficient (i.e., faster) way.

  5. (In general, L1 search would be a faster alternative to forward search, but you have a multilevel reference model, so L1 search is not supported in your case.)

Finally, just a minor remark: In your code, you are specifying projpred:: for the get_refmodel() call, but not for the cv_varsel() call. Does that mean you call library(projpred) beforehand?

3 Likes

Hi Frank,

Thanks so much for your response! I’ve always been so impressed seeing how active this community is!!! I really appreciate you taking the time to write such a helpful response!

No I wasn’t actually, I have now updated the package, let’s see how that effects things. Sounds like this would be a big help, and exactly what I need.

Yes that’s exactly right. I was looking at this post. The impression I got was that I would need to do something like this:

my_search_terms <- c(
    "1",
    "(1 | iso_country_code) + (1 | iso_country_code:village)",
    "(1 | iso_country_code) + (1 | iso_country_code:village) + x1",
    "(1 | iso_country_code) + (1 | iso_country_code:village) + x2",
    "(1 | iso_country_code) + (1 | iso_country_code:village) + ...",
    "(1 | iso_country_code) + (1 | iso_country_code:village) + x10"
)

cv_varsel_res <- cv_varsel(ref_model,
                          method = 'forward', 
                          cv_method = 'kfold', 
                          K = 5,
                          verbose = TRUE, 
                          seed = seed,
                          search_terms = my_search_terms
)

Is that correct, or have I misunderstood? This seems to line up with the issue you linked to?

Thanks so much for the suggestions, I will have play around with the different options! I will let you know how I get on.

Ah yes, I see the inconsistency! Yes I do call library(projpred) beforehand, I don’t know why I have presented it like this!

Thanks again,

Léo

1 Like

This is correct only if you want to terminate the search after the selection of the first x... term (which in turn takes place after the forced selection of (1 | iso_country_code) + (1 | iso_country_code:village)). If I understood your original question correctly, you want to continue the search after that first x... term. For that, you can use the code from #346 (I have just posted a fixed and slightly modified example there).

2 Likes

Hi Frank,

Thanks so much, I see now. I have modified your solution on GitHub very slightly to account for named variables, posting here just in case it is of use to any one in the future:

get_search_terms <- function(fixed_terms, other_predictors) {
  search_terms <- unlist(lapply(1:length(other_predictors), function(m_predictors) {
    lapply(combn(other_predictors, m = m_predictors, simplify = FALSE),
           function(idxs_predictors) {
             paste0(idxs_predictors, collapse = " + ")
           })
  }))
  search_terms <- c(fixed_terms, paste(fixed_terms, "+", search_terms))
  return(search_terms)
}

variables_to_search <- c("x1", "x2","some_named_variable"...)

fixed_variables <- "(1 | iso_country_code) + (1 | iso_country_code:village)"

search_terms <- get_search_terms(fixed_variables,variables_to_search)

cv_varsel(ref_model,
                          method = 'forward', 
                          cv_method = 'kfold', 
                          K = 5,
                          verbose = TRUE, 
                          seed = seed,
                          search_terms=search_terms) 

Thanks again for all the help. I will let you know how I get on!

2 Likes

Issue #346 has now been solved on GitHub (by the addition of a helper function called force_search_terms()).

3 Likes

Hi @fweber144 ,

Related to this thread’s topic, in what instances does it make sense to either include the varying effects as forced search terms at the beginning of the search or “integrate them out” using projpred.mlvl_pred_new (and potentially projpred.mlvl_proj_ref_new)?

From my (albeit limited) understanding, if one is including varying effects (e.g. (1 | group)) in the reference model in order to account for some known structure/clustering/correlation within the data, then one would choose the following depending on the prediction task:

  1. You wish to predict on future data that will have the same (or some of the same) levels of (1 | group) and therefore forcing the search through (1 | group) will ensure it’s influence on the other predictors is taken into account.

  2. You wish to predict on future data that will always have new levels of (1 | group) (or not have any of the structure that was in the training data to begin with, e.g. experimental batches) and thus “integrating-out” the varying intercepts using projpred.mlvl_pred_new makes sense.

I guess the question really boils down to: if you have varying effects in your reference model, why would you ever not want to force their inclusion in the search path?

Hi @rtnliqry ,

The two approaches you mention in your first question:

  1. forcing the group-level terms to be selected at the beginning of the search
  2. integrating out group-level effects using projpred.mlvl_pred_new (and potentially projpred.mlvl_proj_ref_new)

are not mutually exclusive, so I’ll treat them separately.

The first approach makes sense if you want to ensure that the group-level effects are included in all of the submodels. You would do this by using custom search_terms. However, just like other (e.g., population-level) predictor terms, group-level terms may or may not be relevant (here, “relevant” means “relevant in terms of improving predictive performance sufficiently”). So the default is not to force group-level terms to be selected first, but instead to treat them like other predictor terms. If you want to deviate from this default, you would need a justification (at least that’s how I see it).

Whether the second approach makes sense depends on your prediction task (and—for projpred.mlvl_proj_ref_new—whether you want the submodels to fit to the best possible reference-model predictions or not). Given your detailed question, I guess you have already read Projection predictive feature selection — projpred-package • projpred as a reference for the second approach, but I mention it here for other readers nonetheless. In particular, the following part from that page may be helpful for your question:

Setting projpred.mlvl_pred_new to TRUE makes sense, e.g., when the prediction task is such that any group level will be treated as a new one. Typically, setting projpred.mlvl_proj_ref_new to TRUE only makes sense when projpred.mlvl_pred_new is already set to TRUE. In that case, the default of FALSE for projpred.mlvl_proj_ref_new ensures that at projection time, the submodels fit to the best possible fitted values from the reference model, and setting projpred.mlvl_proj_ref_new to TRUE would make sense if the group-level effects should be integrated out completely.

Note that integrating out group-level effects using projpred.mlvl_pred_new (and potentially projpred.mlvl_proj_ref_new) does not mean that the group-level terms are excluded from the search. In principle, group-level terms could still be found among the most relevant predictor terms, even when setting projpred.mlvl_pred_new = TRUE (and potentially projpred.mlvl_proj_ref_new = TRUE). However, I guess this will be rarely the case.

In your statement

I am not sure whether by “forcing the search through (1 | group)“ you mean to force (1 | group) to be selected first (by using custom search_terms), so taking approach 1 from above. If you mean it this way, I would say that your statement is not correct. If your future data will have the same (or some of the same) levels of group, then the “standard” (default) approach would still be to treat associated group-level terms (such as (1 | group)) like other predictor terms, so (1 | group) should not be forced to be selected first. However, if your future data will have the same (or some of the same) levels of group, then it usually does not make sense to integrate out associated group-level terms by setting projpred.mlvl_pred_new = TRUE (or projpred.mlvl_proj_ref_new = TRUE).

In your final question

I am not sure whether by “force their inclusion in the search path” you mean “treat them as candidate predictors” or “force their inclusion at a specific position in the search path“:

  1. If it is “treat them as candidate predictors”, then just like you, I currently cannot imagine a situation where you would want to use custom search_terms which exclude a group-level term. However, there may be situations where this makes sense—I just cannot think of them at the moment. In any case, you would need to justify custom search_terms, so you would also need to justify custom search_terms which exclude a group-level term from the search.
  2. If it is “force their inclusion at a specific position in the search path“, I would say that the justification needs to be the other way round: The “standard” (default) behavior should be to treat group-level terms like other predictor terms and hence include them in the search (i.e., treat them as candidate predictors), but not at a specific position. For any deviation from this default behavior (e.g., for forcing a group-level term to be selected first by using custom search_terms), you would need a justification.
2 Likes