Projection predictive feature selection for multilevel phylogenetic models

I am hoping to use projpred for variable selection on a phylogenetic multilevel model fit with brms. However, [I think] I’ve run up against some of the edges of projpred's current functionality, and am just wondering if there is a viable way forward for me at this point or if I should consider using a different means of identifying relevant predictors for now.

The reference model is for a Bernoulli outcome H and has the following brmsformula:

H ~ sex + longevity + nest + stratum + group_size + movement + centrality + (1|gr(species, cov=A))

I fit this 100 times (using 100 phylogenetic covariance matrices A), then combine the posterior samples using brms::combine_models(). At this point, I would like to identify relevant population terms (keeping the group intercepts in all candidate models). However, based on discussions here and here it seems that this capability isn’t yet implemented. I’m using projpred v2.0.4.9000, and I haven’t found anything in the documentation to suggest otherwise. I’ve been considering a couple of alternatives as potential workarounds, but (as described below) have either encountered additional errors or am not sure whether they represent viable/defensible approaches.

Alternative 1: Search using all terms

First, I would consider performing the forward search using all terms (including the group intercept), but I encounter the following error when calling cv_varsel(br_phylo, method="forward", cv_method="LOO") on the brmsfit object created by combine_models():

Error in model.frame.default(data = data, weights = weights, drop.unused.levels = TRUE,  : 
  invalid type (list) for variable 'gr(species, cov = A)'
In addition: Warning message:
Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.

Based on the call stack, this appears to be coming from a call to fit_glmer_callback(), and gets tripped at the start of the forward search. It seems to be related to the covariance specification in the group term. (As a side note, if I fit the model without specifying a phylogenetic covariance matrix, i.e. with group intercept term (1|species) and then call cv_varsel() without specifying search_terms, I get errors about pwrssUpdate() failing to converge.)

Alternative 2: Search population terms only, then re-fit

If I perform the variable selection on the full phylogenetic model, specifying only the population effects as search terms:

st <- c("1", "sex", "longevity", "nest", "stratum", "group_size", "movement", "centrality")
br_cvvs <- cv_varsel(br_phylo, method="forward", cv_method="LOO", search_terms=st)

Then the operation does complete. However, none of the submodels manages to achieve the same predictive performance as the reference model, which I assume is because none includes the group intercept term (whereas the reference model does):


Moreover, I imagine if I were to try to project to one of these submodels (e.g., project(br_cvvs, nterms=5)), the resulting projection would not take the group intercept term into account. Instead, it appears from the marginal posteriors of the resulting projection that more of the variance is simply assigned to the population effects. I’m wondering, then, if it is sensible to perform the search as indicated (i.e., specifying search_terms to include just the population terms), then to re-fit the model (including group intercept term) using the identified subset of variables. Thus, after examining the plots above and solution terms for br_cvvs, I’m inclined to select 5 terms, and re-fit the submodel with another (100) calls to brm() using the reduced formula.

@AlejandroCatalina is this a reasonable workaround, or do you have any other recommendations? I’d appreciate any guidance or insight you might have to offer. I apologize if I have misunderstood or overlooked anything in the documentation.

1 Like

Hi!

First of all, thanks for using projpred!

As you say, what happens if you only project the population coefficients is that there is a inherent mismatch between projections and reference model so it’s very unlikely that the projections achieve similar predictive performance, assuming the multilevel structure has an impact in this model, which it appears to.

The problem when solving the projection for your model specification is the custom argument to your multilevel terms, namely (1 | gr(species, cov = A)). We don’t support custom covariances for the moment but you can fit (1 | species) without problems in the current projpred, which might be a good indicator of its performance. You don’t need to change your model fit, you can simply add "(1 | species)" to your search_terms list and it should run. But let me know if it doesn’t!

Is this helpful?

3 Likes

Hello—and thanks for your reply!

That makes sense, and is very helpful. Unfortunately, it doesn’t seem work in my case, as I now run into the following error:

Error in pwrssUpdate(pp, resp, tol = tolPwrss, GQmat = GQmat, compDev = compDev,  : 
  pwrssUpdate did not converge in (maxit) iterations
In addition: Warning message:
Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.

This is thrown at 0% in the forward search.

I am afraid this error happens inside lme4 and it can be due to many possible reasons. For one, Bernoulli models are known to be a bit unstable when it comes to the projection. Are you trying varsel or cv_varsel now? varsel may be easier to fit at first as it works with the full data and is quicker. If you are willing to share these data I can take a look personally and see where this happens. I’m sure it probably happens when trying to fit the group intercept-only model. Another possibility is to modify search_terms so that 1 | species is projected only after all other population coefficients are projected.

That’s very kind—but please only take a look if it is also helpful for you/projpred! The brmsfit is too large to attach here, but I’ve uploaded the data and one of the covariance matrices. I’m getting the same errors using a single matrix as when running the full 100 and combining the posteriors.
A.r (1.6 KB) br_data.r (7.4 KB)

To fit the model and run variable selection:

library(brms)
load("br_data.r"); load("A.r")

m_priors <- prior(student_t(6,0,1.5),b) + prior(normal(0,0.5),sd)

m_phylo <- brm( H ~ sex + longevity + nest + stratum + group_size + movement + centrality + (1|gr(species, cov=A)), 
                data=br_data, data2=list(A=A), family=bernoulli(), prior=m_priors, 
                chains=4, iter=2e3, warmup=1e3, cores=4, seed=1138 )

st <- c("1", "sex", "longevity", "nest", "stratum", "group_size", "movement", "centrality", "(1 | species)")

vs <- varsel(m_phylo, method="forward", search_terms=st)
cvvs <- cv_varsel(m_phylo, method="forward", cv_method="LOO")

I have mostly been using cv_varsel(), which is where I get the lme4 error mentioned above. However, varsel() throws the same pwrssUpdate() error (at 60% of terms selected).

Okay, thanks!

Best,
Alejandro

Sorry to open this up again—but how would I go about ensuring that the group intercept is projected only after the population coefficients? I’ve tried re-arranging the order in which search_terms is specified, but that doesn’t seem to have an effect (regardless of the placement of (1 | species) and 1, varsel failes at 60% when all terms are included).

And, I’m now getting a different error from varsel. Rather than pwrssUpdate failing to converge, the procedure fails with

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

I assume I must be doing something differently, but I’m not sure what it is.

So, to alter the order in which a term is projected you have to make sure that all other terms need to go into the projection before the term you want to delay. You achieve that by setting partially correct “subformulas” as elements in search_terms. When you write search_terms <- c("1", "x"), what you are implicitly saying is search_terms <- c("1", "1 + x"). So you require the intercept to be added before x can be considered. If you want to do the same with another term you’d write search_terms <- c("1", "x", "x + z"), so that x needs to go before z.

In your case, this translates to search_terms <- c("1", all_population_terms, ..., paste0(paste(all_population_terms, collapse = " + "), " + (1 | species)")).

Did I explain myself? I hope to have time soon to check your issue, sorry for the delay!

1 Like

No need to apologize! Thanks very much for looking into it.
And yes, I think I understand now how to modify the search_terms argument.

1 Like