Loo_compare vs averaging/weighting via stacking or pseudo-BMA weighting

I ran 8 models, 4 with only fixed effects with stan_glm and 4 mixed models with stan_lmer. And I’m trying to compare them to decide with which one of the 8 I should proceed.

All the models are converged very well, so it seems from the several diagnostic checks (ESS, ACF, trace plots, ess ratio), from the print(loo) output have khat < 0.5 and Monte Carlo SE of elpd_loo is 0.0 in all models.

The dataset contains N = 503 observations (students).

stan_glm(ec ~ g, data = p_1_36ec_trj, prior = normal(), chains = 4, iter = 10000, warmup = 1000)

And a model with stan_lmer with a group specific intercept and slope:

stan_lmer(ec ~ g + (g | chr), data = p_1_36ec_trj, iter = 10000, warmup = 1000, adapt_delta = 0.99)

Thus also for the remaining 3 models.

When using loo for al the 8 models, e.g.

p_1_ec_g_loo <- loo(p_1_ec_g, save_psis = TRUE)

and then using loo_compare, I get this (I added a model name column for comparison with the stacking and weighting below

(name)    elpd_diff    se_diff
model_5         0.0        0.0
model_7        -0.8        2.0
model_8        -3.3        2.1
model_1        -5.2        3.4
model_3        -8.2        3.9
model_4        -8.2        3.9
model_6        -14.7       5.5
model_2        -15.8       5.7

Now, based on the thread here, am I right to assume that the differences between these models are far to small to prefer one over the other? According to @avehtari the se_diff should be taken times 5. Which is not the case here.

Furthermore I used

lpd_point_all <- cbind(
  p_1_ec_g_loo$pointwise[,"elpd_loo"],
  p_1_ec_vo_loo$pointwise[,"elpd_loo"],
  p_1_ec_g_vo_loo$pointwise[,"elpd_loo"],
  p_1_ec_g_vo_ia_loo$pointwise[,"elpd_loo"],
  p_1_ec_g_mm_chr_loo$pointwise[,"elpd_loo"],
  p_1_ec_vo_mm_chr_loo$pointwise[,"elpd_loo"],
  p_1_ec_g_vo_mm_chr_loo$pointwise[,"elpd_loo"],
  p_1_ec_g_vo_ia_mm_chr_loo$pointwise[,"elpd_loo"])

followed by

pbma_wts_all <- pseudobma_weights(lpd_point_all, BB=FALSE)
pbma_BB_wts_all <- pseudobma_weights(lpd_point_all)
stacking_wts_all <- stacking_weights(lpd_point_all)

round(cbind(pbma_wts_all, pbma_BB_wts_all, stacking_wts_all), 3)

which results in:

(rank) (name)       pbma_wts_all   pbma_BB_wts_all   stacking_wts_all
model1 (model_5)           0.004             0.055              0.145
model2 (model_7)           0.000             0.002              0.000
model3 (model_8)           0.001             0.028              0.001
model4 (model_1)           0.000             0.006              0.000
model5 (model_3)           0.666             0.533              0.498
model6 (model_4)           0.000             0.001              0.000
model7 (model_6)           0.304             0.344              0.356
model8 (model_2)           0.025             0.032              0.000

Which of the two is ‘more precise’ the weighing or the comparison of the elpd_diff and se_diff?

And, do the models differ enough to objectively decide for only one of them?

That information has been updated in a paper

  • Tuomas Sivula, Måns Magnusson, and Aki Vehtari (2020). Uncertainty in Bayesian leave-one-out cross-validation based model comparison. arXiv preprint arXiv:2008.10296

which is also listed as a reference in CV-FAQ 15: How to interpret Standard error (SE) of elpd difference (elpd_diff)

You may also benefit from discussion in another thread

You can ignore this, it’s just for those who want to reproduce the experiments in the Bayesian stacking paper.

If you would make this named list, you could name your models and names would show in the weight outputs

When there are many models the weights are easier. If you have two nested models there is a monotonic mapping between the weights and probabilities (more about this coming soonish).

Models 3 and 6 are best, but you can get better predictions by averaging predictions from models 3, 5, and 6.

When models are similar LOO-BB-weights dilute the weights between models having similar predictive performance. I would guess that models 1, 2, 4, 7, and 8 are somehow similar with each other or with the models 3, 5, or 6. When models are similar stacking weights choose from the very similar predictions the best, but averages over different predictive distributions if none of the predictive distributions is the true data generating distribution. So in this case models 3, 5, and 6 are making different kinds of predictions and it can useful to check how they differ. You may also consider Bayesian hierarchical stacking.

Not without additional information about the models, If they are nested, choose the most complex one.

1 Like

The models are as follows, where 1–4 are non hierarchical models and 5–8 have group specific slopes and intercept.

ec = ec
g = gender (factor with 2 levels)
vo = level of secondary eduction (factor with 3 levels)
chr = cohort (factor with 4 levels)

The models try to predict students’ progress based on gender and their preliminary educational level. The models with group specific terms try to account for differences between cohorts.

model_1 <- stan_glm(ec ~ g, data = p_1_36ec_trj, chains = 4, iter = 10000, warmup = 1000)
model_2 <- stan_glm(ec ~ vopl, data = p_1_36ec_trj, chains = 4, iter = 10000, warmup = 1000)
model_3 <- stan_glm(ec ~ g + vopl, data = p_1_36ec_trj, chains = 4, iter = 10000, warmup = 1000)
model_4 <- stan_glm(ec ~ g*vopl, data = p_1_36ec_trj, chains = 4, iter = 10000, warmup = 1000)

model_5 <- stan_lmer(ec ~ g + (g|chr), data = p_1_36ec_trj, iter = 10000, warmup = 1000, adapt_delta = 0.99)
model_6 <- stan_lmer(ec ~ vopl + (vopl|chr), data = p_1_36ec_trj, iter = 10000, warmup = 1000, adapt_delta = 0.99)
model_7 <- stan_lmer(ec ~ vopl + (vopl|chr), data = p_1_36ec_trj, iter = 10000, warmup = 1000, adapt_delta = 0.99)
model_8 <- stan_lmer(ec ~ g + vopl + g:vopl + (g -1 | chr) + (vopl -1 | chr) + (g:vopl -1 | chr), data = p_1_36ec_trj, chains = 4, iter = 10000, warmup = 1000, adapt_delta = 0.999)

I see now, that I made a mistake in the numbering of the models in my previous post. So here are the models displayed correctly:

lpd_point_all <- cbind(
  model1_ec_g = p_1_ec_g_loo$pointwise[,"elpd_loo"],
  model2_ec_vo = p_1_ec_vo_loo$pointwise[,"elpd_loo"],
  model3_ec_g_vo = p_1_ec_g_vo_loo$pointwise[,"elpd_loo"],
  model4_ec_g_vo_ia =  p_1_ec_g_vo_ia_loo$pointwise[,"elpd_loo"],
  model5_ec_g_mm =  p_1_ec_g_mm_chr_loo$pointwise[,"elpd_loo"],
  model6_ec_vo_mm = p_1_ec_vo_mm_chr_loo$pointwise[,"elpd_loo"],
  model7_ec_g_vo_mm = p_1_ec_g_vo_mm_chr_loo$pointwise[,"elpd_loo"],
  model8_ec_g_vo_ia_mm = p_1_ec_g_vo_ia_mm_chr_loo$pointwise[,"elpd_loo"])

And their weights:

       pbma_BB_wts_all stacking_wts_all
model1           0.062            0.145
model2           0.001            0.000
model3           0.028            0.001
model4           0.005            0.000
model5           0.527            0.498
model6           0.001            0.000
model7           0.344            0.356
model8           0.033            0.000

And when comparing 1, 5, and 7:

       pbma_BB_wts_best stacking_wts_best
model1            0.073             0.083
model2            0.553             0.584
model3            0.375             0.333