Projpred with brms object

I have a glmm bernoulli brmsfit object. Calling projpred::varsel or cv_varsel gives an error message and no results.

The model code is:

bprior ← c(prior(student_t(3, 0, 2.5), class = ‘b’))

fit1.brmsfit ←
brm(Hypoglycemic60 ~
AgeClass +
Sex +
BMIClass +
ASA2Class +
# CaseYear +
Emergency +
AdmissionType +
AnesthesiaIntervalClass +
BypassDuration +
ElixCount +
IntraopVasoplegia +

IntraopNutrition +

IntraopHypoglycemic +
IntraopHypotensive +
IntraopInotropeVasopressor +
IntraopBetaBlocker +
EBLClass +
PRBCClass +
(1 | MPOGInstitutionID)

,
data = tmp.df,
family = ‘bernoulli’,
prior = bprior,
chains = 4,
seed = 9876,
iter = 2000,
control = list(
max_treedepth = 40)
)

fit1.brmsfit
Family: bernoulli
Links: mu = logit
Formula: Hypoglycemic60 ~ AgeClass + Sex + BMIClass + ASA2Class + Emergency + AdmissionType + AnesthesiaIntervalClass + BypassDuration + ElixCount + IntraopVasoplegia + IntraopHypoglycemic + IntraopHypotensive + IntraopInotropeVasopressor + IntraopBetaBlocker + EBLClass + PRBCClass + (1 | MPOGInstitutionID)
Data: tmp.df (Number of observations: 72610)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Group-Level Effects:
~MPOGInstitutionID (Number of levels: 29)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 0.63 0.14 0.40 0.95 1.00 1682 2464

The model fitting is OK (Rhat, ESS).

There are a few slightly high k estimates.

Computed from 4000 by 72610 log-likelihood matrix

     Estimate    SE

elpd_loo -1420.7 80.8
p_loo 44.1 3.2
looic 2841.4 161.6
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 72602 100.0% 1328
(0.5, 0.7] (ok) 8 0.0% 3147
(0.7, 1] (bad) 0 0.0%
(1, Inf) (very bad) 0 0.0%

All Pareto k estimates are ok (k < 0.7).
See help(‘pareto-k-diagnostic’) for details.

The data frame tmp.df has no missing values.

The versions are brms 2.16.3, rstan 2.21.3 and projpred 2.0.2.

fit1.vsel ← varsel(fit1.brmsfit)
Error in model.frame.default(data = data, weights = weights, drop.unused.levels = TRUE, :
object is not a matrix

fit1.vsel ← cv_varsel(fit1.brmsfit)
Warning in cv_varsel.refmodel(refmodel, …) :
K provided, but cv_method is LOO.
Warning: Some Pareto k diagnostic values are slightly high. See help(‘pareto-k-diagnostic’) for details.

[1] “Computing LOOs…”
| | 0%Error in model.frame.default(data = data, weights = weights, drop.unused.levels = TRUE, :
object is not a matrix

I have searched and haven’t found an explanation for this error message.

Nathan

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] projpred_2.0.2 mice_3.14.0 rstan_2.21.3 StanHeaders_2.21.0-7 loo_2.4.1 Hmisc_4.6-0 Formula_1.2-4 lattice_0.20-44
[9] skimr_2.1.3 readit_1.0.0 brms_2.16.3 Rcpp_1.0.7 stevemisc_1.3.0 broom.mixed_0.2.7 broom_0.7.10 sjlabelled_1.1.8
[17] sjPlot_2.8.10 lme4_1.1-27.1 Matrix_1.3-4 Epi_2.44 cobalt_4.3.1 MatchIt_4.3.2 logistf_1.24 sjmisc_2.8.9
[25] collapse_1.6.5 car_3.0-12 carData_3.0-4 coin_1.4-2 survival_3.2-13 arsenal_3.6.3 tableone_0.13.0 tables_0.9.6
[33] lubridate_1.8.0 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.1.1 tidyr_1.1.4 tibble_3.1.6
[41] ggplot2_3.3.5 tidyverse_1.3.1 magrittr_2.0.1 tinytex_0.36 rmarkdown_2.11 knitr_1.37

ping @AlejandroCatalina @paul.buerkner

could you try it with the lastest projpred developmental version to be installed via remotes::install_github("stan-dev/projpred", ref = "develop")

1 Like

Thanks for pointing me to the version on github.

projpred started without error message.

However, no projpred result can be obtained after various attempts.

My glmm model has about 15 covariates with 30 predictors. The sample size is 72610, but the number of events is 226. I used a t prior.

All the model estimation properties are ok.

Running cv_varsel(model) crashes R on the linux server (134 GB RAM).

I assume that this is memory management problem.

I was able to obtain a reference model using get_refmodel. The object size is 13.8 GB.

Running cv_varsel on the reference model also crashed the R session.

I must assume that my data set with a sparse outcome is not suitable for projective prediction.

Perhaps this is an example of don’t try models unless there are 10 or more events for each predictor.

1 Like

Hmm, interesting. I don’t really know what exactly happens. @AlejandroCatalina do you have an intuition?

1 Like

Well I can assume projpred is trying to project too many posterior draws, can you try with cv_varsel(model, ndraws=10, ndraws_pred=10) or other equally small numbers?

1 Like

Just skimmed quickly through this and as far as I understand @nlpacestan’s code, there are 72610 observations here. Calling cv_varsel() with default arguments (as in @nlpacestan’s code) would then result in (PSIS-)LOO CV with a validated search, i.e., 72610 searches are run. That’s pretty much and could cause a memory issue. K-fold CV could be a remedy, or (PSIS-)LOO CV without a validated search (but that has the downside of, well, an unvalidated search).

1 Like

Note that for multilevel reference models fit in brms, the K-fold CV in projpred can (but doesn’t have to necessarily) lead to an error which I’m about to submit as an issue on GitHub soon.

EDIT: That issue may now be found here.

I followed this suggestion.

tmp.cvsel ← cv_varsel(fit1.brmsfit, ndraws=10, ndraws_pred=10)

Warning: Some Pareto k diagnostic values are slightly high. See help(‘pareto-k-diagnostic’) for details.

[1] “Repeating forward search for 72610 LOO folds…”

After 36 hours, the walltime on the server expired.

There was no object returned.

???

Nathan

Yeah it might be better in this case to avoid full LOO, you can start by running the same arguments with varsel only instead of cv_varsel.

Best,
Alex

I successfully ran using vsel(object, ndraws = 10, ndraws_pred = 10).

Family: binomial
Link function: logit

Formula: Hypoglycemic60 ~ AgeClass + Sex + BMIClass + ASA2Class + Emergency +
AdmissionType + AnesthesiaIntervalClass + BypassDuration +
ElixCount + IntraopVasoplegia + IntraopHypoglycemic + IntraopHypotensive +
IntraopInotropeVasopressor + IntraopBetaBlocker + EBLClass +
PRBCClass + (1 | MPOGInstitutionID)
Observations: 72610
Search method: forward, maximum number of terms 17
Number of clusters used for selection: 10
Number of clusters used for prediction: 10
Suggested Projection Size: 17

Selection Summary:
size solution_terms elpd se diff diff.se
0 -1530.2 86.5 -153.6 18.4
1 AgeClass -1499.6 84.9 -123.0 15.4
2 ElixCount -1477.3 83.7 -100.8 13.5
3 BMIClass -1447.6 82.0 -71.1 10.4
4 BypassDuration -1436.3 81.5 -59.7 9.2
5 Sex -1427.7 81.0 -51.1 8.5
6 EBLClass -1425.3 80.9 -48.7 8.1
7 IntraopHypoglycemic -1423.0 80.8 -46.4 7.7
8 ASA2Class -1420.4 80.7 -43.8 7.6
9 IntraopHypotensive -1418.7 80.6 -42.1 7.6
10 IntraopVasoplegia -1417.6 80.5 -41.0 7.4
11 Emergency -1417.0 80.5 -40.4 7.3
12 AdmissionType -1416.5 80.5 -39.9 7.3
13 PRBCClass -1416.1 80.5 -39.5 7.4
14 AnesthesiaIntervalClass -1415.7 80.4 -39.1 7.3
15 IntraopBetaBlocker -1415.5 80.4 -38.9 7.4
16 IntraopInotropeVasopressor -1415.5 80.4 -38.9 7.4
17 (1 | MPOGInstitutionID) -8145.9 541.0 -6769.3 464.9

solution_terms(tmp.vsel)
[1] “AgeClass” “ElixCount” “BMIClass” “BypassDuration”
[5] “Sex” “EBLClass” “IntraopHypoglycemic” “ASA2Class”
[9] “IntraopHypotensive” “IntraopVasoplegia” “Emergency” “AdmissionType”
[13] “PRBCClass” “AnesthesiaIntervalClass” “IntraopBetaBlocker” “IntraopInotropeVasopressor”
[17] “(1 | MPOGInstitutionID)”
suggest_size(tmp.vsel)
[1] 17

There are 17 terms in the model. The model fit met the usual criteria. No terms can be removed from the model by projpred.

The plot of the vsel object shows elpd increasing with each term until the 17th term, the group intercept parameters. Then the elpd drops by about 6800. The grouping factor has 29 levels; there are 72610 observations, but only 226 events. Is this drop in the elpd with the 17th term expected?

I had attempted frequentist glmer models first, but couldn’t get stable estimations including boundary problems.

I was hoping that by using some informative priors (used student t) I could find some relationships. But it appears that the signal is too weak .

I appreciate the help from the experts.

Nathan

ndraws_pred = 10 seems quite low to me. I would recommend starting with the defaults for ndraws and ndraws_pred. But you need the development version of projpred (GitHub branch develop) to use the updated defaults. Apart from that, I would recommend the development version in any case.

As far as I know, the fact that a frequentist GLMM has boundary issues hints at a hierarchical standard deviation close to zero. That could perhaps explain the dropping ELPD when the group-level term comes in: If the actual hierarchical standard deviation is zero, its (marginal) posterior in the Bayesian model will still have all of its probability mass on the positive real line (because of the nature of MCMC sampling).

P.S.: Could you perhaps use code formatting for code and code output? That’s one backtick at each side of inline code and three backticks above and below multi-line code.