Favors better LOO-CV results or Pareto k diagnostics for model selection

Dear all, I have two competing models based on sampling healthcare facilities.

The models are binomial and try to model vaccine effectiveness for some disease.

For each hospital facility (~400) I have the number of cases and the denominators.
I’m evaluating two competing models, one (mod1) using a random intercept for the interfacility variation and one (mod2) using a beta-binomial model to account for the facility-related overdispersion.

apparently model 1 performs better:

brms::loo_compare(brms::loo(mod1), brms::loo(mod2))
                   elpd_diff se_diff
mod1               0.0       0.0  
mod2               -89.5     63.2  

But the first model has a lot of observations with high Pareto_k:

> brms::loo(mod1)

Computed from 15000 by 746 log-likelihood matrix.

         Estimate    SE
elpd_loo  -1808.8  71.7
p_loo       498.0  36.0
looic      3617.7 143.4
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 2.3]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     384   51.5%   130     
   (0.7, 1]   (bad)      306   41.0%   <NA>    
   (1, Inf)   (very bad)  56    7.5%   <NA>    

while the second model had almost none:

> brms::loo(mod2)

Computed from 15000 by 746 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1898.4 42.6
p_loo        34.0  2.1
looic      3796.7 85.2
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.7]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     745   99.9%   3525    
   (0.7, 1]   (bad)        1    0.1%   <NA>    
   (1, Inf)   (very bad)   0    0.0%   <NA>    

Could it be that the first model is showing too much sensibility to individual observetations, i.e. the better performance is due to overfitting?

The results are similar:

#  model 1
Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 ((1-plogis(Vaccin... > 0     0.64       0.1     0.46     0.79        Inf         1    *

#  model 2
Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 ((1-plogis(Vaccin... > 0     0.75      0.13     0.52     0.92      14999         1    *

with the second model showing higher vaccine effectiveness values and a tighter credibility interval but also higher Est.Error

What reasoning should drive me in the choice between the two models?

I think the interpretation, that the first model is more surprised by (or sensitive to) many of the datapoints is fitting. You should get a warning about using either moment_match = TRUE or even reloo = true I think.
In either case, the high number of high pareto-k values makes the psis approximation less reliable and the comparison is downstream of that.
I would recompute loo for the model using either moment matching if that is enough or with reloo, which sadly will take a substantial amount of time I suppose. Afterwards you can re-compare the models with a trustworthy elpd estimate.

For PSIS-LOO to be reliable for models with random coefficients, it is necessary that there be a large-ish (or at least moderate) number of datapoints in every group. This can be made intuitive in the limit that a group has only one observation, because removing that one observation should obviously have a substantial influence on the posterior that could be expected to be difficult to remove via PSIS.

In your case, the upshot is that based on the results you have obtained so far, you cannot infer which model should be expected to have better leave-one-out predictive performance. If I understand correctly, and you have just a single observation per group, moment_match is probably not going to be sufficient to save you, and reloo might be computationally burdensome. Your best bet might be to switch to k-fold CV, e.g. for k = 10.

Thanks for both your questions.

I tried k-fold compare with loo_compare(kfold(mod1), kfold(mod2), criterion =“kfold”), but after a long while it failed due to factor level mismatch: “New factor levels are not allowed”.
Any suggestion on how to properly run it in the presence of categorical predictors?

Here’s a description of the data:

'data.frame':	746 obs. of  10 variables:
 $ Num                     : int  63 71 1 71 8 1 4 1 2 0 ...
 $ Denom                   : int  88 156 1 178 110 1 118 49 24 1 ...
 $ StudyLin                : num  1 1 1 1 1 1 1 1 1 1 ...
 $ HospSize                : num  0.477 1.05 1.187 1.187 0.7 ...
 $ HospType                : chr  "MIX" "GNH" "GNH" "GNH" ...
 $ TestingStrategy: chr  "ALLLTCF" "ALLLTCF" "ALLLTCF" "ALLLTCF" ...
 $ ReportingCountry        : chr  "ES" "ES" "ES" "ES" ...
 $ Vaccination             : chr  "FullVax" "FullVax" "PartialVax" "FullVax" ...
 $ Study                   : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ OutbreakID              : chr  "01-1747099.2021-07-21" "01-1755332.2021-07-19" "01-1811803.2021-08-01" "01-1811803.2021-08-01" ...

For the theory point of view, so loo cannot generally be trusted for comparison if observation-level random intercepts are used to account for overdispersion?
Btw, technically the model has more than one row per random effect since, for each facility, we have numerators and denominators by vaccine status, with the number of statutes being variable between facilities (but usually, it’s two in most cases).

You need to separate LOO itself and how the computation is made. LOO can be trusted if the computation can be trusted. In your case, it’s the importance sampling computation (PSIS) that fails when using draws from the joint posterior. See also CV-FAQ 3. What are the parts of cross-validation.

PSIS-LOO can be used for varying effect models, by integrating out those varying effects, which is relatively easy if there is only one or two parameters per group. This is illustrated in Roaches case study with a Poisson model with varying intercept. That case study is very close to what you have. brms doesn’t yet have integrated LOO implemented, and you would need to do some extra coding, but we’re working on it to make it easy with brms, too.

For kfold you need to add argument allow_new_levels=TRUE. As running K-fold-CV can take some time, it might be better to store the results before using loo_compare(). For example, as

mod1 <- add_criterion(mod1, criterion="kfold", allow_new_levels=TRUE)
1 Like

Sorry for the late reply.

I tried allow_new_levels=TRUE with kfold, but I get the Error: New factor levels are not allowed.

Also, I tried another model with only random effects for the categorical variables and it now doesn’t have high pareto key observations but it fits much worse.

I think it’s useful to start from the beginning.
The study evaluates the protection effect of vaccination by a number of doses during COVID-10 outbreaks in long-term care facilities during two different study periods (a proxy for different virus variants).

Here’s the initial model (highly calibrated fit with a high number of influential observations, 41.2% bad, 8.2% very bad):
Num | trials(Denom) ~ 0 + Vaccination:Study + StudyLin * (LTCFSize + LTCFType + TestingStrategyResidents + ReportingCountry) + (1 | OutbreakID)
family: binomial
It’s a simple model with a different partially pooled intercept for each vaccination status by study period plus the effect of the study period itself its interaction with various facility-level characteristics.

The following is a calibration plot of this model compared with the observed rates (regularized using a qbeta(.5, case + 1, non-cases + 1) prior to smooth the observed zeros).

Then I used a model with the same formula but a beta-binomial distribution, which fitted the data worse but had much lower influential observations (4.9% bad)

Finally, I fitted another model with only the beta-binomial response without a random intercept for the outbreaks, which is the worst performing but has no high pareto k observations.

Interestingly, if I simulate new outbreaks, the difference in fitting between models almost disappears (differences in uncertainty remain):

The average bayes R2 for the three models on the training data are respectively 97%, 69%, and 47%. If I simulate new outbreak IDs I get more or less R2: 47% for all them.

So, the extra parameters by facility outbreak provide most of the “better” fitting, which makes sense.
But my estimand of interest is the effect of vaccination in the two studies, so I wonder if the clear overfit due to the random intercept increases or decreases the bias of its estimates.

Sorry, there is not enough information to solve this error. Would you like to share a minimal data and model that gives this error?

Are you using cross-validation predictions? If you are using just the posterior predictions then this calibration plot is not useful.

If you did not use cross-validation, then you don’t know if it’s worse or if the binomial model has overfitted which is likely as there are so many highly influential observations.

If these are not cross-validated then they are not useful for model comparison.

Dear Aki,

I created a mock model with simulated data: Transfer - Dropbox
I tested it using:

mod_sim_cv <- add_criterion(mod_sim, criterion="kfold", allow_new_levels=TRUE, group = "OutbreakID", folds = "grouped", K = 5)

and indeed I get the error:

Error: New factor levels are not allowed.
Levels allowed: 'GNH', 'MD', 'MIX', 'O', 'RH', 'RSH'
Levels found: 'GNH', 'MD', 'MIX', 'RSH', 'UNK'

Ping @paul.buerkner . Looking at the code in kfold.R it looks like

  ll_args$allow_new_levels <- TRUE

but why does it not work here?

1 Like

The dropbox link timed out. Can you share a minimal reprex for me to run?

Dear Paul,

Thank you for looking into this.

Here’s a new link: Transfer - Dropbox

The code to generate the model is very simple, but the main reproducibility ingredient is the data which is contained in the model. Here’s the code to reproduce the error given the original data (which I cannot disclose since it’s collected by the EU countries):

data_sim <- data |>
        pNum = Num / Denom,
        pDenom = Denom / round(exp(LTCFSize + 4)),
        Size = rpois(n(), exp(LTCFSize + 4))+1,
        LTCFSize = log(Size) - 4,
        Denom = rbinom(n(), prob = pDenom, size = Size),
        Num = rbinom(n(), prob = pNum, size = Denom),
        Vaccination, Study,
        StudyLin = as.numeric(Study),
        LTCFType, TestingStrategyResidents,
        across(c(ReportingCountry, OutbreakID), ~ sapply(.x, rlang::hash)),
        .keep = "none"
    ) |> select(-pNum, -pDenom, -Size)

priors <- c(
    brms::set_prior("student_t(3, 0, .5)", class = "sd"),
    brms::set_prior("student_t(3, 0, 1.5)", class = "b", lb = -6, ub = 4.5))

formula <- Num | trials(Denom) ~ 0 + Vaccination:Study + StudyLin * (LTCFSize + LTCFType + TestingStrategy{pop} + ReportingCountry) + (1 | OutbreakID)

mod_ref_sim <- brms::brm(
    formula = formula,
    family = stats::binomial(),
    data = data_sim,
    prior = priors,
    cores = future::availableCores(), chains = 8, backend = "cmdstanr",
    iter = 2000, seed = 9283749, save_pars = brms::save_pars(all = TRUE)

mod_ref_sim_cv <- add_criterion(mod_ref_sim, criterion="kfold", allow_new_levels=TRUE, group = "OutbreakID", folds = "grouped", K = 5)
# Error appears here

Once you download the model from the link you just need to run the last line to reproduce.


I would prefer not downloading a 3 GB model just for this error, Can you come up with a minimal reprex based on a small simulated dataset, such that I can just execute a short script myself to see the error without downloading anything?

Also note that the brmsmodel contains the data so through dropbox you are actually sharing (some parts of) the data still.