I am hoping to use
projpred for variable selection on a phylogenetic multilevel model fit with
brms. However, [I think] I’ve run up against some of the edges of
projpred's current functionality, and am just wondering if there is a viable way forward for me at this point or if I should consider using a different means of identifying relevant predictors for now.
The reference model is for a Bernoulli outcome
H and has the following
H ~ sex + longevity + nest + stratum + group_size + movement + centrality + (1|gr(species, cov=A))
I fit this 100 times (using 100 phylogenetic covariance matrices
A), then combine the posterior samples using
brms::combine_models(). At this point, I would like to identify relevant population terms (keeping the group intercepts in all candidate models). However, based on discussions here and here it seems that this capability isn’t yet implemented. I’m using
projpred v22.214.171.12400, and I haven’t found anything in the documentation to suggest otherwise. I’ve been considering a couple of alternatives as potential workarounds, but (as described below) have either encountered additional errors or am not sure whether they represent viable/defensible approaches.
Alternative 1: Search using all terms
First, I would consider performing the forward search using all terms (including the group intercept), but I encounter the following error when calling
cv_varsel(br_phylo, method="forward", cv_method="LOO") on the
brmsfit object created by
Error in model.frame.default(data = data, weights = weights, drop.unused.levels = TRUE, : invalid type (list) for variable 'gr(species, cov = A)' In addition: Warning message: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.
Based on the call stack, this appears to be coming from a call to
fit_glmer_callback(), and gets tripped at the start of the forward search. It seems to be related to the covariance specification in the group term. (As a side note, if I fit the model without specifying a phylogenetic covariance matrix, i.e. with group intercept term
(1|species) and then call
cv_varsel() without specifying
search_terms, I get errors about
pwrssUpdate() failing to converge.)
Alternative 2: Search population terms only, then re-fit
If I perform the variable selection on the full phylogenetic model, specifying only the population effects as search terms:
st <- c("1", "sex", "longevity", "nest", "stratum", "group_size", "movement", "centrality") br_cvvs <- cv_varsel(br_phylo, method="forward", cv_method="LOO", search_terms=st)
Then the operation does complete. However, none of the submodels manages to achieve the same predictive performance as the reference model, which I assume is because none includes the group intercept term (whereas the reference model does):
Moreover, I imagine if I were to try to project to one of these submodels (e.g.,
project(br_cvvs, nterms=5)), the resulting projection would not take the group intercept term into account. Instead, it appears from the marginal posteriors of the resulting projection that more of the variance is simply assigned to the population effects. I’m wondering, then, if it is sensible to perform the search as indicated (i.e., specifying
search_termsto include just the population terms), then to re-fit the model (including group intercept term) using the identified subset of variables. Thus, after examining the plots above and solution terms for
br_cvvs, I’m inclined to select 5 terms, and re-fit the submodel with another (100) calls to
brm()using the reduced formula.
@AlejandroCatalina is this a reasonable workaround, or do you have any other recommendations? I’d appreciate any guidance or insight you might have to offer. I apologize if I have misunderstood or overlooked anything in the documentation.