I am hoping to use projpred for variable selection on a phylogenetic multilevel model fit with brms. However, [I think] I’ve run up against some of the edges of projpred's current functionality, and am just wondering if there is a viable way forward for me at this point or if I should consider using a different means of identifying relevant predictors for now.
The reference model is for a Bernoulli outcome H and has the following brmsformula:
H ~ sex + longevity + nest + stratum + group_size + movement + centrality + (1|gr(species, cov=A))
I fit this 100 times (using 100 phylogenetic covariance matrices A), then combine the posterior samples using brms::combine_models(). At this point, I would like to identify relevant population terms (keeping the group intercepts in all candidate models). However, based on discussions here and here it seems that this capability isn’t yet implemented. I’m using projpred v2.0.4.9000, and I haven’t found anything in the documentation to suggest otherwise. I’ve been considering a couple of alternatives as potential workarounds, but (as described below) have either encountered additional errors or am not sure whether they represent viable/defensible approaches.
Alternative 1: Search using all terms
First, I would consider performing the forward search using all terms (including the group intercept), but I encounter the following error when calling cv_varsel(br_phylo, method="forward", cv_method="LOO") on the brmsfit object created by combine_models():
Error in model.frame.default(data = data, weights = weights, drop.unused.levels = TRUE, :
invalid type (list) for variable 'gr(species, cov = A)'
In addition: Warning message:
Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.
Based on the call stack, this appears to be coming from a call to fit_glmer_callback(), and gets tripped at the start of the forward search. It seems to be related to the covariance specification in the group term. (As a side note, if I fit the model without specifying a phylogenetic covariance matrix, i.e. with group intercept term (1|species) and then call cv_varsel() without specifying search_terms, I get errors about pwrssUpdate() failing to converge.)
Alternative 2: Search population terms only, then re-fit
If I perform the variable selection on the full phylogenetic model, specifying only the population effects as search terms:
st <- c("1", "sex", "longevity", "nest", "stratum", "group_size", "movement", "centrality")
br_cvvs <- cv_varsel(br_phylo, method="forward", cv_method="LOO", search_terms=st)
Then the operation does complete. However, none of the submodels manages to achieve the same predictive performance as the reference model, which I assume is because none includes the group intercept term (whereas the reference model does):
Moreover, I imagine if I were to try to project to one of these submodels (e.g.,
project(br_cvvs, nterms=5)), the resulting projection would not take the group intercept term into account. Instead, it appears from the marginal posteriors of the resulting projection that more of the variance is simply assigned to the population effects. I’m wondering, then, if it is sensible to perform the search as indicated (i.e., specifying search_terms to include just the population terms), then to re-fit the model (including group intercept term) using the identified subset of variables. Thus, after examining the plots above and solution terms for br_cvvs, I’m inclined to select 5 terms, and re-fit the submodel with another (100) calls to brm() using the reduced formula.
@AlejandroCatalina is this a reasonable workaround, or do you have any other recommendations? I’d appreciate any guidance or insight you might have to offer. I apologize if I have misunderstood or overlooked anything in the documentation.
