Favors better LOO-CV results or Pareto k diagnostics for model selection

Dear all, I have two competing models based on sampling healthcare facilities.

The models are binomial and try to model vaccine effectiveness for some disease.

For each hospital facility (~400) I have the number of cases and the denominators.
I’m evaluating two competing models, one (mod1) using a random intercept for the interfacility variation and one (mod2) using a beta-binomial model to account for the facility-related overdispersion.

apparently model 1 performs better:

brms::loo_compare(brms::loo(mod1), brms::loo(mod2))
                   elpd_diff se_diff
mod1               0.0       0.0  
mod2               -89.5     63.2  

But the first model has a lot of observations with high Pareto_k:

> brms::loo(mod1)

Computed from 15000 by 746 log-likelihood matrix.

         Estimate    SE
elpd_loo  -1808.8  71.7
p_loo       498.0  36.0
looic      3617.7 143.4
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 2.3]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     384   51.5%   130     
   (0.7, 1]   (bad)      306   41.0%   <NA>    
   (1, Inf)   (very bad)  56    7.5%   <NA>    

while the second model had almost none:

> brms::loo(mod2)

Computed from 15000 by 746 log-likelihood matrix.

         Estimate   SE
elpd_loo  -1898.4 42.6
p_loo        34.0  2.1
looic      3796.7 85.2
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.7]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     745   99.9%   3525    
   (0.7, 1]   (bad)        1    0.1%   <NA>    
   (1, Inf)   (very bad)   0    0.0%   <NA>    

Could it be that the first model is showing too much sensibility to individual observetations, i.e. the better performance is due to overfitting?

The results are similar:

#  model 1
Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 ((1-plogis(Vaccin... > 0     0.64       0.1     0.46     0.79        Inf         1    *

#  model 2
Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 ((1-plogis(Vaccin... > 0     0.75      0.13     0.52     0.92      14999         1    *

with the second model showing higher vaccine effectiveness values and a tighter credibility interval but also higher Est.Error

What reasoning should drive me in the choice between the two models?

I think the interpretation, that the first model is more surprised by (or sensitive to) many of the datapoints is fitting. You should get a warning about using either moment_match = TRUE or even reloo = true I think.
In either case, the high number of high pareto-k values makes the psis approximation less reliable and the comparison is downstream of that.
I would recompute loo for the model using either moment matching if that is enough or with reloo, which sadly will take a substantial amount of time I suppose. Afterwards you can re-compare the models with a trustworthy elpd estimate.

For PSIS-LOO to be reliable for models with random coefficients, it is necessary that there be a large-ish (or at least moderate) number of datapoints in every group. This can be made intuitive in the limit that a group has only one observation, because removing that one observation should obviously have a substantial influence on the posterior that could be expected to be difficult to remove via PSIS.

In your case, the upshot is that based on the results you have obtained so far, you cannot infer which model should be expected to have better leave-one-out predictive performance. If I understand correctly, and you have just a single observation per group, moment_match is probably not going to be sufficient to save you, and reloo might be computationally burdensome. Your best bet might be to switch to k-fold CV, e.g. for k = 10.

Thanks for both your questions.

I tried k-fold compare with loo_compare(kfold(mod1), kfold(mod2), criterion =“kfold”), but after a long while it failed due to factor level mismatch: “New factor levels are not allowed”.
Any suggestion on how to properly run it in the presence of categorical predictors?

Here’s a description of the data:

'data.frame':	746 obs. of  10 variables:
 $ Num                     : int  63 71 1 71 8 1 4 1 2 0 ...
 $ Denom                   : int  88 156 1 178 110 1 118 49 24 1 ...
 $ StudyLin                : num  1 1 1 1 1 1 1 1 1 1 ...
 $ HospSize                : num  0.477 1.05 1.187 1.187 0.7 ...
 $ HospType                : chr  "MIX" "GNH" "GNH" "GNH" ...
 $ TestingStrategy: chr  "ALLLTCF" "ALLLTCF" "ALLLTCF" "ALLLTCF" ...
 $ ReportingCountry        : chr  "ES" "ES" "ES" "ES" ...
 $ Vaccination             : chr  "FullVax" "FullVax" "PartialVax" "FullVax" ...
 $ Study                   : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ OutbreakID              : chr  "01-1747099.2021-07-21" "01-1755332.2021-07-19" "01-1811803.2021-08-01" "01-1811803.2021-08-01" ...

For the theory point of view, so loo cannot generally be trusted for comparison if observation-level random intercepts are used to account for overdispersion?
Btw, technically the model has more than one row per random effect since, for each facility, we have numerators and denominators by vaccine status, with the number of statutes being variable between facilities (but usually, it’s two in most cases).

You need to separate LOO itself and how the computation is made. LOO can be trusted if the computation can be trusted. In your case, it’s the importance sampling computation (PSIS) that fails when using draws from the joint posterior. See also CV-FAQ 3. What are the parts of cross-validation.

PSIS-LOO can be used for varying effect models, by integrating out those varying effects, which is relatively easy if there is only one or two parameters per group. This is illustrated in Roaches case study with a Poisson model with varying intercept. That case study is very close to what you have. brms doesn’t yet have integrated LOO implemented, and you would need to do some extra coding, but we’re working on it to make it easy with brms, too.

For kfold you need to add argument allow_new_levels=TRUE. As running K-fold-CV can take some time, it might be better to store the results before using loo_compare(). For example, as

mod1 <- add_criterion(mod1, criterion="kfold", allow_new_levels=TRUE)
1 Like