More monologue recapitulating new (hopefully non-erroneous) insights:
Only when the summed variables are uncorrelated. Things become more complicated when they’re not.
This crucial distinction hasn’t been emphasized enough in the materials I’ve seen: the fact that there are, in fact, two layers of uncertainty in elpd_diff, and se_diff only describes one of them. For the uninitiated, matters are made worse by the fact that the models under comparison might have perfectly good convergence, with no warnings or red flags whatsoever, and the second layer of uncertainty in elpd_diff (its variability by random seed) might go completely unnoticed.
In teaching myself how independent these two uncertainty layers are, I found the following mock-data simulation to be helpful: suppose we’re predicting the height of working-age adults in two similar countries, and we are especially interested in whether the two populations differ. Suppose that they actually do — by a tiny amount — and our sample size is not sufficient to identify the difference reliably. Here’s the mock data:
N <- 400 # set insufficient sample size
set.seed(340)# for reproducibility
# Create "true" populations:
finland <- data.frame(sexF = rep(c(0,1), times = c(2765962,2868214)))
denmark <- data.frame(sexF = rep(c(0,1), times = c(2953238,2988150)))
# Generate y from "true" data-generating process:
finland$height <- with(finland, rnorm(nrow(finland), mean = 180 -sexF*15, sd = 6))
denmark$height <- with(denmark, rnorm(nrow(denmark), mean = 181 -sexF*15, sd = 6))
# Collect an inadequate sample from the "true" populations.
Sample <- rbind(cbind(country = "0Finland", finland)[sample(nrow(finland), N/2),], cbind(country = "Denmark", denmark)[sample(nrow(denmark), N/2),])
We use brms to fit two simple linear models, one excluding the “weak but real” predictor, the other one including it. We deliberately use a number of simulations that is sufficient to achieve convergence but insufficient to eliminate MCSE_ELPDLOO:
m1_500.1 <- brm(height ~ sexF, warmup = 1000, iter = 1000+500, prior = prior("", class = Intercept) + prior("", class = b), seed = 1, data = Sample, chains = 1, refresh = 0, silent = 2)
(m2_500.1 <- update(m1_500.1, ~. +country, newdata = Sample, seed = 1))
Family: gaussian
Links: mu = identity; sigma = identity
Formula: height ~ sexF + country
Data: Sample (Number of observations: 400)
Draws: 1 chains, each with iter = 1500; warmup = 1000; thin = 1;
total post-warmup draws = 500
Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 179.19 0.50 178.22 180.16 1.00 442 331
sexF -14.37 0.56 -15.39 -13.30 1.00 531 409
countryDenmark 0.86 0.56 -0.26 1.94 1.00 527 397
Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 5.94 0.20 5.56 6.34 1.00 574 438
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
When we calculate elpd_loos, note that MCSE is reported as small but NOT zero:
m1_500.1.elpd <- loo(m1_500.1); (m2_500.1.elpd <- loo(m2_500.1))
Computed from 500 by 400 log-likelihood matrix
Estimate SE
elpd_loo -1281.6 13.6
p_loo 3.6 0.3
looic 2563.2 27.2
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
So, what can loo_compare()
tell us about the relative predictive utility of these models?
as.data.frame(loo_compare(m1_500.1.elpd, m2_500.1.elpd)[,1:4])
elpd_diff se_diff elpd_loo se_elpd_loo
m2_500.1 0.0000000 0.000000 -1281.582 13.58275
m1_500.1 -0.4358444 1.483557 -1282.018 13.52415
The difference reported by PSIS-LOO is negligible, but the ranking is correct. How reliably is the correct ranking reproduced in new pairwise comparisons obtained using different seeds? (I added the NA rows to separate the individual pairs for readability)
for(newseed in 2:10){
modnames <- paste0(c("m1_500.","m2_500."),newseed)
assign(modnames[1], update(m1_500.1, seed = newseed))
tmp.a <- loo(get(modnames[1])); attributes(tmp.a)$model_name <- modnames[1]
assign(modnames[2], update(m2_500.1, seed = newseed))
tmp.b <- loo(get(modnames[2])); attributes(tmp.b)$model_name <- modnames[2]
assign(paste0(modnames[1],".elpd"), tmp.a); assign(paste0(modnames[2],".elpd"), tmp.b)}
(comparisons.nsim500 <- do.call(rbind, lapply(1:10, function(x) rbind(as.matrix(loo_compare(get(paste0("m1_500.",x,".elpd")), get(paste0("m2_500.",x,".elpd"))))[,1:4], " " = NA))))
elpd_diff se_diff elpd_loo se_elpd_loo
m2_500.1 0.00000000 0.000000 -1281.582 13.58275
m1_500.1 -0.43584439 1.483557 -1282.018 13.52415
NA NA NA NA
m1_500.2 0.00000000 0.000000 -1281.860 13.67334
m2_500.2 -0.01745390 1.398704 -1281.877 13.51334
NA NA NA NA
m1_500.3 0.00000000 0.000000 -1281.899 13.62972
m2_500.3 -0.10765674 1.364659 -1282.007 13.62807
NA NA NA NA
m1_500.4 0.00000000 0.000000 -1281.963 13.58248
m2_500.4 -0.05271535 1.398228 -1282.016 13.60424
NA NA NA NA
m2_500.5 0.00000000 0.000000 -1281.810 13.62903
m1_500.5 -0.11476526 1.415954 -1281.925 13.52973
NA NA NA NA
m2_500.6 0.00000000 0.000000 -1282.065 13.52476
m1_500.6 -0.07838442 1.461803 -1282.143 13.71377
NA NA NA NA
m2_500.7 0.00000000 0.000000 -1281.787 13.58247
m1_500.7 -0.05540648 1.400752 -1281.842 13.63035
NA NA NA NA
m2_500.8 0.00000000 0.000000 -1281.947 13.54562
m1_500.8 -0.30879311 1.442163 -1282.255 13.63053
NA NA NA NA
m1_500.9 0.00000000 0.000000 -1281.862 13.57227
m2_500.9 -0.01564121 1.380227 -1281.878 13.60491
NA NA NA NA
m1_500.10 0.00000000 0.000000 -1282.043 13.58460
m2_500.10 -0.01107914 1.398408 -1282.054 13.57277
NA NA NA NA
Wow, it looks like the correctness of the initial comparison was pure luck! The new comparisons rank the models wrong in 5 out of 4 cases, and the reported differences are even smaller now! This is fully concordant with the notion that such small differences are indistinguishable from noise. Notice, however, that the differences do tend to be larger whenever the correct model is winning. So while the comparisons are hopelessly noisy, a tiny bit of signal can nonetheless be discerned — at least in an artificial setting where we know what “signal” to look for.
But given our knowledge that m2 is, indeed, the correct model, is it possible to reproduce the correct ranking with any reliability without collecting more data? Let’s see what happens when we refit the two models to the same inadequate sample 10 more times, now with 400 times more posterior draws.
for(Seed in 1:10){
modnames <- paste0(c("m1_200k.","m2_200k."), Seed)
assign(modnames[1], update(m1_500.1, seed = Seed,
warmup = 1000, iter = 1000+200000))
tmp.a <- loo(get(modnames[1])); attributes(tmp.a)$model_name <- modnames[1]
assign(modnames[2], update(m2_500.1, seed = Seed,
warmup = 1000, iter = 1000+200000))
tmp.b <- loo(get(modnames[2])); attributes(tmp.b)$model_name <- modnames[2]
rm(list = modnames)}
m1_200k.2.elpd
Computed from 200000 by 400 log-likelihood matrix
Estimate SE
elpd_loo -1281.9 13.6
p_loo 2.9 0.3
looic 2563.9 27.2
------
Monte Carlo SE of elpd_loo is 0.0.
<snip>
At the very least, MCSE_ELPDLOO is now estimated at zero. How do pairwise comparisons look between these new, gigantic fits?
(comparisons.nsim200k <- do.call(rbind, lapply(1:10, function(x) rbind(as.matrix(loo_compare(get(paste0("m1_200k.",x,".elpd")), get(paste0("m2_200k.",x,".elpd"))))[,1:4], " " = NA))))
elpd_diff se_diff elpd_loo se_elpd_loo
m2_200k.1 0.00000000 0.000000 -1281.895 13.55837
m1_200k.1 -0.04682957 1.447243 -1281.942 13.60188
NA NA NA NA
m2_200k.2 0.00000000 0.000000 -1281.913 13.56032
m1_200k.2 -0.02713931 1.441532 -1281.941 13.60715
NA NA NA NA
m2_200k.3 0.00000000 0.000000 -1281.906 13.55734
m1_200k.3 -0.03195478 1.442552 -1281.938 13.60323
NA NA NA NA
m2_200k.4 0.00000000 0.000000 -1281.905 13.56055
m1_200k.4 -0.03989458 1.440435 -1281.944 13.60271
NA NA NA NA
m2_200k.5 0.00000000 0.000000 -1281.905 13.55446
m1_200k.5 -0.04431128 1.440652 -1281.949 13.60623
NA NA NA NA
m2_200k.6 0.00000000 0.000000 -1281.920 13.55854
m1_200k.6 -0.01312898 1.435604 -1281.933 13.60100
NA NA NA NA
m2_200k.7 0.00000000 0.000000 -1281.894 13.55909
m1_200k.7 -0.04446383 1.444024 -1281.939 13.60896
NA NA NA NA
m2_200k.8 0.00000000 0.000000 -1281.913 13.55801
m1_200k.8 -0.03009314 1.443213 -1281.943 13.60354
NA NA NA NA
m2_200k.9 0.00000000 0.000000 -1281.909 13.56150
m1_200k.9 -0.03680435 1.442428 -1281.945 13.60413
NA NA NA NA
m2_200k.10 0.00000000 0.000000 -1281.911 13.56087
m1_200k.10 -0.03440431 1.446022 -1281.945 13.60277
NA NA NA NA
Elpd_diffs are still reported to be negligible. In the aggregate, they’re even smaller than in the first set of comparisons. The average of the reported se_diffs doesn’t improve either: in fact it gets slightly worse, the new average being 1.442 as opposed to 1.414 earlier (EDIT: earlier drafts had ridiculous miscalculations here). So, as Aki says, increasing posterior draws won’t help reduce se_diff.
Yet something remarkable was indeed gained: despite the infinitesimal magnitude of the differences, the models are now correctly ranked every single time. And this happens despite the fact that the se_diffs actually increase. What a revelation! Even to the mathematically challenged, this illustrates beyond question that it is indeed better to present a single comparison between two huge fits, rather than multiple comparisons between repeated small fits.
It sure would be nice to have a formula for estimating the degree to which elpd_diff is bound to vary by seed as a result of the respective MC_SEs of the loo objects. With complex multilevel models, obtaining posterior simulations in the hundreds of thousands is extremely time-intensive, and the resulting matrices of parameter values or pointwise LLs are so huge that running out of RAM is a frequent problem.