Errors running cv_varsel , and a search terms question

Hi all,

I am a PhD student using brms to run mixed models to understand the effect of parental factors on the fertility of eggs. My model currently looks like this - understanding the effect ion maternal and paternal ages, ages as quadratic effects, inbreeding coefficients and interactions between age and inbreeding coefficient for each parent.

model_dev <- brm(Fertile ~ 
	#linear fixed effects   
                Pair.Inbreeding + 
	#quadratic fixed effects
                   I(Dam.age.years^2) + I(Sire.age.years^2) +
	#interactive fixed effects
                   Sire.Inbreeding*Sire.age.years + Dam.Inbreeding*Dam.age.years +
	#random effects
                   (1|Sire.Studbook.ID) + (1|Dam.Studbook.ID)+
                   (1|Lay.Year) + (1|Location),
                 data = data_dev_sc,
                 family = "bernoulli",
                 prior = priors_dev,
                 iter = 30000,
                 control=list(adapt_delta=0.99),
                 cores = 8,
                 sample_prior = TRUE,
                 seed = 13579)

Based on other research and with the help of other forum members, I have now successfully set my priors are as follows:

priors_dev <- c(set_prior("normal(0,5)", class = "b", coef = "Dam.age.years"),
            set_prior("normal(-5,5)", class = "b", coef = "Sire.age.years"),
            set_prior("normal(0,10)", class = "b", coef = "Sire.Inbreeding"),
            set_prior("normal(0,10)", class = "b", coef = "Dam.Inbreeding"),
            set_prior("normal(-40,20)", class = "b", coef = "Pair.Inbreeding"),
            set_prior("normal(0,5)", class = "b", coef = "IDam.age.yearsE2"),
            set_prior("normal(0.5,5)", class = "b", coef = ā€œISire.age.yearsE2"))

After running the model, I am wanting to do variable selection using projpred’s cv_varsel, but have been hitting the following error:

Error in updateGlmerDevfun(devfun, glmod$reTrms, nAGQ = nAGQ) : 
  nAGQ > 1 is only available for models with a single, scalar random-effects term

I am following the methods of a paper which uses ā€˜search_terms’ argument of cv_varsel to ensure that the fixed effects are explored before the random effects and are included in every model. As a result my search_terms currently look like this:

search_terms<-c(
"1",
"Pair.Inbreeding",
"Dam.age.years",
"Sire.age.years",
"Dam.Inbreeding",
"Sire.Inbreeding",
"Dam.age.years+Dam.Inbreeding+Dam.age.years:Dam.Inbreeding",
"Sire.age.years+Sire.Inbreeding+Sire.age.years:Sire.Inbreeding",
paste0(
"Sire.age.years+I(Sire.age.years^2)+Dam.age.years+I(Dam.age.years^2)+Dam.age.years:Dam.Inbreeding+",
"Dam.Inbreeding+Sire.Inbreeding+Pair.Inbreeding+Sire.age.years:Sire.Inbreeding+",
"(1|Sire.Studbook.ID)"
),
paste0(
"Sire.age.years+I(Sire.age.years^2)+Dam.age.years+I(Dam.age.years^2)+Dam.age.years:Dam.Inbreeding+",
"Dam.Inbreeding+Sire.Inbreeding+Pair.Inbreeding+Sire.age.years:Sire.Inbreeding+",
"(1|Dam.Studbook.ID)"
),
paste0(
"Sire.age.years+I(Sire.age.years^2)+Dam.age.years+I(Dam.age.years^2)+Dam.age.years:Dam.Inbreeding+",
"Dam.Inbreeding+Sire.Inbreeding+Pair.Inbreeding+Sire.age.years:Sire.Inbreeding+",
"(1|Lay.Year)"
),
paste0(
"Sire.age.years+I(Sire.age.years^2)+Dam.age.years+I(Dam.age.years^2)+Dam.age.years:Dam.Inbreeding+",
"Dam.Inbreeding+Sire.Inbreeding+Pair.Inbreeding+Sire.age.years:Sire.Inbreeding+",
"(1|Location)"
)
)

Is there a better way to do this? My aim is simply to prioritise the fixed terms over the random terms when the variable selection is happening.

My cv_varsel code is as follows:

cvvs_dev <- cv_varsel( 
  ref_dev,
  validate_search = TRUE,
### prioritise fixed effects first
  search_terms = search_terms,
  refit_prj=FALSE,
  nclusters_pred = 20,
  seed = 13579
)

Please see full code below to simulate equivalent data to recreate the error.

Many thanks!

library(tidyverse)
library(brms)
library(projpred)

set.seed(13579)
predictor_mean <- 0
predictor_sd <- .5
Pair.Inbreeding <- Dam.age.years <- Sire.age.years <- Dam.Inbreeding <- Sire.Inbreeding <- rnorm(539, mean = predictor_mean, sd = predictor_sd)
Fertile <- as.factor(sample(c("Y","N"), size = 539, replace = T))
Dam.Studbook.ID <- as.factor(sample(c(1:52), size=539, replace = T))
Sire.Studbook.ID <- as.factor(sample(c(53:104), size=539, replace = T))
Lay.Year <- as.factor(sample(c(1988:2019), size = 539, replace = T))
Location <- as.factor(sample(c(1:12), size = 539, replace = T))

data_dev_sc <- tibble(
    Fertile,
    Dam.age.years,
    Sire.age.years,
    Dam.Inbreeding,
    Sire.Inbreeding,
    Pair.Inbreeding,
    Sire.Studbook.ID,
    Dam.Studbook.ID,
    Lay.Year,
    Location
)

# Set egg_dev priors
priors_dev <- c(
    set_prior("normal(0,5)", class = "b", coef = "Dam.age.years"),
    set_prior("normal(-5,5)", class = "b", coef = "Sire.age.years"),
    set_prior("normal(0,10)", class = "b", coef = "Sire.Inbreeding"),
    set_prior("normal(0,10)", class = "b", coef = "Dam.Inbreeding"),
    set_prior("normal(-40,20)", class = "b", coef = "Pair.Inbreeding"),
    set_prior("normal(0,5)", class = "b", coef = "IDam.age.yearsE2"),
    set_prior("normal(0.5,5)", class = "b", coef = "ISire.age.yearsE2")
)

model_dev <- brm(Fertile ~ 
                   Pair.Inbreeding + 
                   I(Dam.age.years^2) + I(Sire.age.years^2) +
                   Sire.Inbreeding*Sire.age.years + Dam.Inbreeding*Dam.age.years +
                   (1|Sire.Studbook.ID) + (1|Dam.Studbook.ID)+
                   (1|Lay.Year) + (1|Location),
                 data = data_dev_sc,
                 family = "bernoulli",
                 prior = priors_dev,
                 iter = 30000,
                 control=list(adapt_delta=0.99),
                 cores = 8,
                 sample_prior = TRUE,
                 seed = 13579)
ref_dev <- brms:::get_refmodel.brmsfit(model_dev, seed = 13579)

search_terms <- c(
    "1",
    "Pair.Inbreeding",
    "Dam.age.years",
    "Sire.age.years",
    "Dam.Inbreeding",
    "Sire.Inbreeding",
    "Dam.age.years+Dam.Inbreeding+Dam.age.years:Dam.Inbreeding",
    "Sire.age.years+Sire.Inbreeding+Sire.age.years:Sire.Inbreeding",
    paste0(
        "Sire.age.years+I(Sire.age.years^2)+Dam.age.years+I(Dam.age.years^2)+Dam.age.years:Dam.Inbreeding+",
        "Dam.Inbreeding+Sire.Inbreeding+Pair.Inbreeding+Sire.age.years:Sire.Inbreeding+",
        "(1|Sire.Studbook.ID)"
    ),
    paste0(
        "Sire.age.years+I(Sire.age.years^2)+Dam.age.years+I(Dam.age.years^2)+Dam.age.years:Dam.Inbreeding+",
        "Dam.Inbreeding+Sire.Inbreeding+Pair.Inbreeding+Sire.age.years:Sire.Inbreeding+",
        "(1|Dam.Studbook.ID)"
    ),
    paste0(
        "Sire.age.years+I(Sire.age.years^2)+Dam.age.years+I(Dam.age.years^2)+Dam.age.years:Dam.Inbreeding+",
        "Dam.Inbreeding+Sire.Inbreeding+Pair.Inbreeding+Sire.age.years:Sire.Inbreeding+",
        "(1|Lay.Year)"
    ),
    paste0(
        "Sire.age.years+I(Sire.age.years^2)+Dam.age.years+I(Dam.age.years^2)+Dam.age.years:Dam.Inbreeding+",
        "Dam.Inbreeding+Sire.Inbreeding+Pair.Inbreeding+Sire.age.years:Sire.Inbreeding+",
        "(1|Location)"
    )
)

cvvs_dev <- cv_varsel(
    ref_dev,
    ### Only for the sake of speed (not recommended in general):
    validate_search = TRUE,
    ### prioritise fixed effects first
    search_terms = search_terms,
    nclusters_pred = 20,
    ### nterms_max = 12,
    seed = 13579
)

The nAGQ error is from package lme4 which is used by projpred for performing the projection (namely by fitting an lme4 model to ā€œtheā€ fit of the reference model). It is caused by projpred encountering an lme4 convergence error and then trying to avoid that automatically by increasing argument nAGQ of lme4::glmer(). You could now try to play around with the lme4 tuning parameters (see lme4::glmer()) until the error disappears (without compromising the convergence of the estimation algorithm used by lme4). For that, I guess you would have to adapt projpred:::divmin() to replace the automatic nAGQ adjustment in these lines by something which works in your case. The adapted projpred:::divmin() function then needs to be passed to argument div_minimizer of init_refmodel() (via brms:::get_refmodel.brmsfit() in your case). However, instead of taking this tedious way directly, I think you should try the latent projection (https://mc-stan.org/projpred/articles/latent.html) first. The latent projection might be able to avoid the lme4 convergence error underlying the nAGQ error and in general, it is also able to avoid this lme4 issue. However, the latent projection is only an approximate solution.

Concerning your search_terms question:

Is there a better way to do this?

Unfortunately, there is no better way at the moment. For the future, we plan some enhancements (see also Fixing certain terms to be included in all submodels in cv_varsel in a high dimensional setting Ā· Issue #346 Ā· stan-dev/projpred Ā· GitHub), but we haven’t made it yet to implement these.

Some minor remarks:

  1. In cv_varsel(), you are using refit_prj=FALSE and nclusters_pred = 20. The nclusters_pred = 20 part is not necessary and can be omitted because you are using the default of nclusters = 20 for the search part. Apart from that, for final results, your nclusters / refit_prj / nclusters_pred setting seems quite rough. See also my answer at Projpred: Fixing Group Effects in Search Terms and Tips for Speed? - #2 by fweber144.

  2. In the brm() call, you have the predictor terms I(Dam.age.years^2) and I(Sire.age.years^2). These should now be supported correctly by projpred (recently, there have been some fixes for such terms), but in general (also outside of projpred), such special terms are always a bit more dangerous than pre-calculating the corresponding predictor data, appending it as new columns to the dataset, and then using these pre-calculated predictor columns as ā€œusualā€ predictor terms in the model formula.

1 Like

Thanks for this! Using the latent projection enabled cv_varsel to complete. However, when following the guide you kindly provided, I am hitting a new error when trying to plot the outcome of cv_varsel, as follows:

plot(cvvs_dev, stats = "mlpd", deltas = TRUE)

gives the error:

`latent_ilink` returned only `NA`s, so all performance statistics will also be `NA` as long as `resp_oscale = TRUE`.
`latent_ilink` returned only `NA`s, so all performance statistics will also be `NA` as long as `resp_oscale = TRUE`.
Error in out[["foldwise"]][, seq_len(nterms_max), drop = FALSE] : 
  subscript out of bounds

Thanks again for your continued support.

1 Like

Thank you for reporting this. Could you provide a reprex on projpred’s issue tracker, also mentioning which projpred version you used for creating cvvs_dev as well as for that plot() call?