Errors/warnings with loo::cv_varsel() on large brms fit in R

I am encountering some interesting errors and warnings when running loo::cv_varsel() on a large brms fit. To start, my brms looks like:

> URAM.brm1 <- brm(URAM ~ MOTOR.avg.sc + NOHIKER.avg.sc + # rec. disturbance
+                    CR_CLOSURE.sc + PROJ_AGE_1.sc + NDVI.sc +  # forest structure       
+                    ag_DIST.sc + # proximity to resources
+                    ODHE + # prey species
+                    linear_DIST.sc +  # linear features
+                    SLOPE.sc +  # land structure
+                    Height.sc + Dist_T.sc + # camera set vars
+                    #season + # seasonal variation
+                    (1|stat), # station as a random effect 
+                  family=bernoulli,
+                  control=list(max_treedepth=15),
+                  iter=100000,
+                  chains=4,
+                  warmup=50000,
+                  cores=4,
+                  data=dat)

…where URAM is a binomial response, and all explanatory parameters are numerical, scaled, and centered and there is one random effect (stat). Once this has finished running, it spits out the first warning:

Compiling Stan program...
Start sampling
Warning message:
In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :
  'C:/PROGRA~1/Git/usr/mingw_/bin/g++' not found

but it appears to have converged, and there are no tell-tale signs of fit issues from diagnostic plots.

> URAM.brm1
 Family: bernoulli 
  Links: mu = logit 
Formula: URAM ~ MOTOR.avg.sc + NOHIKER.avg.sc + CR_CLOSURE.sc + PROJ_AGE_1.sc + NDVI.sc + ag_DIST.sc + ODHE + linear_DIST.sc + SLOPE.sc + Height.sc + Dist_T.sc + (1 | stat) 
   Data: dat (Number of observations: 3533) 
Samples: 4 chains, each with iter = 1e+05; warmup = 50000; thin = 1;
         total post-warmup samples = 2e+05

Group-Level Effects: 
~stat (Number of levels: 58) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.08      0.19     0.76     1.51 1.00    65113   103002

Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -3.46      0.20    -3.88    -3.10 1.00   117911   125988
MOTOR.avg.sc      -0.26      0.14    -0.58    -0.03 1.00   238673   131665
NOHIKER.avg.sc    -0.07      0.11    -0.34     0.09 1.00   187252   121781
CR_CLOSURE.sc      0.03      0.19    -0.36     0.40 1.00   112977   126476
PROJ_AGE_1.sc      0.09      0.21    -0.33     0.51 1.00   114679   128449
NDVI.sc            0.17      0.09    -0.00     0.36 1.00   306417   143086
ag_DIST.sc        -0.61      0.25    -1.13    -0.12 1.00   108087   116678
ODHE               0.52      0.20     0.11     0.91 1.00   268955   152079
linear_DIST.sc    -0.31      0.25    -0.83     0.16 1.00   140013   130325
SLOPE.sc          -0.38      0.20    -0.79     0.01 1.00   115506   133349
Height.sc          0.20      0.16    -0.12     0.51 1.00   124956   133975
Dist_T.sc         -0.20      0.19    -0.58     0.15 1.00   118030   125127

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

You’ll notice that some variables (e.g. NDVI.sc and SLOPE.sc) are nearly significant from a 95% CI +/- test (no overlap with zero), so the goal is to discern whether any uninformative parameters can be removed from the model using cv_varsel() (therefore possibly rendering more informative parameters as significant). I call cv_varsel() as such:

> URAM.1_cv <- cv_varsel(URAM.brm1, cv_method='LOO', cores=4)

The model appears to work initially…

1] "Computing LOOs..."
  |                                                                           |   0%

but then begins to print the same warning repeatedly…

boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular

…this is printed maybe 50-100 times before showing…

  |=                                                                          |   1%

…then immediately begins printing another 50-100 instances of…

boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular

before ultimately culminating with:

Error in pwrssUpdate(pp, resp, tol = tolPwrss, GQmat = GQmat, compDev = compDev,  : 
  pwrssUpdate did not converge in (maxit) iterations
In addition: Warning messages:
1: In cv_varsel.refmodel(refmodel, ...) :
  K provided, but cv_method is LOO.
2: Quick-TRANSfer stage steps exceeded maximum (= 10000000) 
3: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

I am unsure of whether my issue is with the initial brms fit, or with the cv_varsel call, so would greatly appreciate some input on this if anyone else has experienced similar issues. I believe perhaps the Bayesian model is “overfit” given the repetitive error of “boundary (singular) fit” usually accompanies overfit models in lme4, but I am somewhat new to brms() so unsure if that’s a similar issue in this field.

tagging @avehtari & @andrewgelman given their expertise in loo & cv_varsel()

I think there’s error here as cv_varsel is in projpred package.

Which projpred version you are using? As you have term (1 | stat) you need to use a quite recent version, and even then I’m not certain if hierarchical components in brsm are supported yet, @AlejandroCatalina ?

Yes these are supported and work right away.

But, as @xprockox is saying, the model seems to be a bit problematic. I would first run varsel on this model because it’s faster and gives you a first intuition. If that works (I’m not sure because the pwrssUpdate error may happen already fitting the full data projections) you can proceed to diagnose cv_varsel in more detail. My guess is that the projection becomes ill-defined when adding the random intercept. If that’s the case we can try couple workarounds to avoid entering the random intercept too early.

Ah, my mistake re: loo vs. projpred. I’m getting my packages mixed up.

To clarify, I am using projpred v 2.0.2 with R 4.0.3, and am not calling cv_varsel() with loo:: appended to the front. I’m simply calling cv_varsel(fit), so my misunderstanding did not interfere with the call.

I have attempted to use varsel() instead and received similar singular fit warnings:

> URAM.varsel <- varsel(URAM.brm1, cores=4)
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
[1] "10% of terms selected."
boundary (singular) fit: see ?isSingular
[1] "20% of terms selected."
boundary (singular) fit: see ?isSingular
[1] "30% of terms selected."
boundary (singular) fit: see ?isSingular
[1] "40% of terms selected."
boundary (singular) fit: see ?isSingular
[1] "50% of terms selected."
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
[1] "60% of terms selected."
boundary (singular) fit: see ?isSingular
[1] "70% of terms selected."
boundary (singular) fit: see ?isSingular
[1] "80% of terms selected."
boundary (singular) fit: see ?isSingular
[1] "90% of terms selected."
boundary (singular) fit: see ?isSingular
[1] "100% of terms selected."
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
boundary (singular) fit: see ?isSingular
Warning message:
In varsel.refmodel(refmodel, ...) :
  The performance of the projected model seems to be misleading, we recommend checking the reference model as well as running `varsel` with a larger `ndraws_pred` or `cv_varsel` for a more robust estimation.

However, the object (URAM.varsel) was actually created this time, which I suppose is an improvement. Nevertheless, I fear the results may still not be all that useful given there may be issues with model fit.
Results here:

> URAM.varsel
   size     elpd elpd.se
2     0  -837.28   38.98
3     1  -796.16   37.92
4     2  -785.96   37.74
5     3  -777.62   37.34
6     4  -772.41   37.03
7     5  -769.55   37.05
8     6  -767.33   37.02
9     7  -765.61   36.89
10    8  -764.78   36.92
11    9  -764.06   36.92
12   10  -763.62   36.91
13   11  -763.60   36.90
14   12 -8109.82  523.23

Given the 14th parameter is the random effect, based on these results, it seems that parameter might be the root of a lot of my issues. The random effect was included to account for nonindependence of sampling weeks within sites (each row of this dataframe represents the data collected at a specific site during a specific week). But perhaps it is not necessary to include this?

Yeah, that’s what I guessed. It seems the random intercept is not properly identifiable in this model. How is the model performance without the random intercept?

It appears that the two models are comparable without the random intercept. The first model outputs (with random):

> URAM.brm1
 Family: bernoulli 
  Links: mu = logit 
Formula: URAM ~ MOTOR.avg.sc + NOHIKER.avg.sc + CR_CLOSURE.sc + PROJ_AGE_1.sc + NDVI.sc + ag_DIST.sc + ODHE + linear_DIST.sc + SLOPE.sc + Height.sc + Dist_T.sc + (1 | stat) 
   Data: dat (Number of observations: 3533) 
Samples: 4 chains, each with iter = 1e+05; warmup = 50000; thin = 1;
         total post-warmup samples = 2e+05

Group-Level Effects: 
~stat (Number of levels: 58) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.08      0.19     0.76     1.51 1.00    65113   103002

Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -3.46      0.20    -3.88    -3.10 1.00   117911   125988
MOTOR.avg.sc      -0.26      0.14    -0.58    -0.03 1.00   238673   131665
NOHIKER.avg.sc    -0.07      0.11    -0.34     0.09 1.00   187252   121781
CR_CLOSURE.sc      0.03      0.19    -0.36     0.40 1.00   112977   126476
PROJ_AGE_1.sc      0.09      0.21    -0.33     0.51 1.00   114679   128449
NDVI.sc            0.17      0.09    -0.00     0.36 1.00   306417   143086
ag_DIST.sc        -0.61      0.25    -1.13    -0.12 1.00   108087   116678
ODHE               0.52      0.20     0.11     0.91 1.00   268955   152079
linear_DIST.sc    -0.31      0.25    -0.83     0.16 1.00   140013   130325
SLOPE.sc          -0.38      0.20    -0.79     0.01 1.00   115506   133349
Height.sc          0.20      0.16    -0.12     0.51 1.00   124956   133975
Dist_T.sc         -0.20      0.19    -0.58     0.15 1.00   118030   125127

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

vs. the same model without the random intercept:

> URAM.brm1.1
 Family: bernoulli 
  Links: mu = logit 
Formula: URAM ~ MOTOR.avg.sc + NOHIKER.avg.sc + CR_CLOSURE.sc + PROJ_AGE_1.sc + NDVI.sc + ag_DIST.sc + ODHE + linear_DIST.sc + SLOPE.sc + Height.sc + Dist_T.sc 
   Data: dat (Number of observations: 3533) 
Samples: 4 chains, each with iter = 1e+05; warmup = 50000; thin = 1;
         total post-warmup samples = 2e+05

Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         -3.20      0.11    -3.41    -3.00 1.00   161784   139136
MOTOR.avg.sc      -0.22      0.13    -0.50    -0.02 1.00   193543   116810
NOHIKER.avg.sc    -0.15      0.12    -0.41     0.04 1.00   196284   125506
CR_CLOSURE.sc      0.08      0.08    -0.07     0.24 1.00   210210   155471
PROJ_AGE_1.sc      0.13      0.10    -0.06     0.31 1.00   188402   159673
NDVI.sc            0.15      0.09    -0.01     0.32 1.00   232269   141160
ag_DIST.sc        -0.51      0.12    -0.74    -0.28 1.00   169817   146003
ODHE               0.86      0.17     0.52     1.20 1.00   219632   153498
linear_DIST.sc    -0.46      0.20    -0.88    -0.11 1.00   182250   131272
SLOPE.sc          -0.37      0.09    -0.55    -0.19 1.00   215924   149326
Height.sc          0.23      0.07     0.09     0.37 1.00   227267   155837
Dist_T.sc         -0.01      0.07    -0.16     0.13 1.00   219525   152007

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

This is okay, but when looking at the areas plot, it appears the random effect is one of the most significant parameters in the model with the 95% CI test. So, instead of excluding (1|stat) entirely, I’m looking into the varsel() & cv_varsel() functions, and it appears there’s an optional argument: “search_terms”. Could I create a list of all parameters except (1|stat) and call:

cv_varsel(URAM.brm1, search_terms=params)

Or will this bias the cross-validation approach somehow?

If this effectively works around the issue of the cv_varsel() and varsel() functions not running, while maintaining integrity of the model, I think this is the best way to move forward.

Additionally, once I fit the model without the random intercept, I ran cv_varsel() and these are the results:

> URAM1.1_cv <- cv_varsel(URAM.brm1.1, cv_method='LOO', cores=4)
[1] "Computing LOOs..."
  |===========================================================================| 100%
[1] "Performing the selection using all the data.."
[1] "Done."
Warning messages:
1: In cv_varsel.refmodel(refmodel, ...) :
  K provided, but cv_method is LOO.
2: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

> URAM1.1_cv$vind
NULL

> URAM1.1_cv
   size    elpd elpd.se
2     0 -838.23   39.04
3     1 -798.70   38.11
4     2 -789.38   37.96
5     3 -781.92   37.62
6     4 -777.61   37.35
7     5 -792.18   38.58
8     6 -777.97   37.72
9     7 -775.31   37.66
10    8 -777.67   37.88
11    9 -777.38   37.88
12   10 -776.66   37.96
13   11 -777.54   38.00

> varsel_plot(URAM.1_cv, stats = c('elpd', 'rmse'))
Error in varsel_plot(URAM.1_cv, stats = c("elpd", "rmse")) : 
  could not find function "varsel_plot"


nv <- suggest_size(URAM1.1_cv, alpha=0.1)
nv
> [2]

I’m assuming the warning messages are not fatal, as they seem to imply some of the parameters are not informative (ultimately what I’m trying to diagnose anyways). However, it is interesting that URAM1.1_cv$vind is NULL, and R can’t find the varsel_plot() function, given I am following along with the steps outlined by @avehtari here: Bayesian variable selection for candy ranking data, and have all required packages loaded.

I can omit the plot, as the elpd values are still reported, and the suggested size still prints. However, without URAM1.1_cv$vind, I am unsure if there is a specific ordering of the parameters. It appeared in the Candy example that parameters are ranked and df$vind reports that ranking. Is my understanding of this correct, and is there another way to see the order in which parameters are added?

Thanks so much for all your help so far–this CV approach to variable selection has been deemed super promising in my lab group, but nobody has attempted it yet, so I’m very grateful to have some expert input on where my errors are.

Update:

I attempted to run cv_varsel() with search_terms=params, and am receiving:

> params<-c("HIKER.avg.sc", "MOTOR.avg.sc", "NOHIKER.avg.sc",
+           "CR_CLOSURE.sc", "PROJ_AGE_1.sc", "NDVI.sc", 
+           "water_DIST.sc", "ag_DIST.sc", 
+           "linear_DIST.sc",
+           "ELEVATION.sc", "SLOPE.sc", 
+           "Height.sc", "Dist_T.sc")
> URAM_cv <- cv_varsel(URAM.brm0, cv_method='LOO', 
+                      search_terms=params,
+                      cores=16)
Error: cannot allocate vector of size 10.5 Gb

Seems there’s a massive amount of data to be dealt with, and even a 16-core 64GB RAM virtual machine isn’t capable of processing this. I also tried:

>memory.limit()
[1] 65535
> memory.limit(size=70000)

which didn’t seem to help.

We updated projpred some time ago and some of the API changed with it. For instance, we no longer have $vind slot, and instead use $solution_terms. You might benefit from taking a look at the updated vignettes on the package.

Best,
Alejandro

That’s surprising. I guess it has to do with cross validating lots of models and the memory required, which is quite heavy. We have some memory improvements to do over time, but we are not actively working on that at the moment.

No worries! I really appreciate all of your help troubleshooting, but it seems this might just not be the right fit for my work. I’ll keep it handy for another time, and thanks again for talking me through some of this!

I think I am running into a similar issue with my data. I have presence/absence data of lesions in 3 different regions of the body and am fitting a nested Bernoulli model with fixed effect of region and random intercept of subject to assess the effect of region on lesion presence by extracting estimated marginal means and making pairwise comparisons. I was curious as to whether any between-subject covariates would be helpful as predictors and therefore ran a full reference model and planned to use projpred for model selection.

When running the full model, it converges without issue:

case_proportion_total_level_complete<-read.csv(“Case Proportion Total Level Complete.csv”,header=T)
as.factor(case_proportion_total_level_complete$ID)->case_proportion_total_level_complete$ID
as.factor(case_proportion_total_level_complete$Sex)->case_proportion_total_level_complete$Sex
as.factor(case_proportion_total_level_complete$Subtype)->case_proportion_total_level_complete$Subtype
as.factor(case_proportion_total_level_complete$Level)->case_proportion_total_level_complete$Level
as.factor(case_proportion_total_level_complete$LesionPresence)->case_proportion_total_level_complete$LesionPresence

model.case.proportion.total.level.complete<-brm(formula=LesionPresence ~ Level + Sex + AgeDeath + DD + PMI + (1 | ID), data=case_proportion_total_level_complete, family=“bernoulli”,warmup=2000,iter=4000,seed=34)

Next, when running cv_varsel I continually see the boundary (singular) fit: see ?isSingular message and it takes forever to run.

I then tried running the varsel command which runs though it does throw the same singular fit message and also states:

Warning message:
In varsel.refmodel(refmodel, …) :
The performance of the projected model seems to be misleading, we recommend checking the reference model as well as running varsel with a larger ndraws_pred or cv_varsel for a more robust estimation.

Looking at solution_terms, I see the following:

[1] “(1 | ID)” “Level” “Sex” “DD” “PMI” “AgeDeath”

I then plotted elpd, rmse, auc, and pctcorr and see the following:

I am bit naive to all of this and was wondering how I should interpret this results? I was thinking I should just go forward with a simpler model that includes the random effect and level as I am interested in within-subjects comparisons and the between-subjects covariates seemed to make the model worse?