Though the thread is solved, I’m trying to clarify my thoughts some more, and it seems better to continue here rather than starting a new thread.
I’d like to formulate the “models reduce variance” idea as succinctly as possible without becoming vague.
Could we truthfully state the following: LOO-CV has high variance because we are comparing N “point responses” y_i to N “point predictions” p_{\text{M}_k}(y_i|y_{-i}) and having to estimate the variance of the resulting \widehat{\text{elpd}}_{\text{LOO},i} scores empirically, whereas with a trustworthy reference model we compare N distributions with a known shape (and hence variance) to N other distributions with a known shape (and hence variance), thereby not having to estimate a separate variance empirically from the “point” quantities?
Also: For Bernoulli models, is \text{log}~p_{\text{M}_k}(y_i|y_{-i}) the same thing as
dbinom(D$y[i,], 1, prob = brms::fitted(update(Mk, newdata = D[-i,]), newdata = D[i,])[,1], log = TRUE)
??
I tested both a manual LOO script and kfold(Mk, folds = "loo") on fake Bernoulli data and the results were almost but not exactly identical, so I’m wondering whether it’s because of simulation variability or because I’m computing the wrong thing.
BIG EDIT: Maybe it’s as simple as this: When there’s no distributional assumption, ANY variability in \text{log}~p_{\text{M}_k}(y_i|y_{-i}) across observations will increase the “uncertainty” of \widehat{\text{elpd}}_{\text{LOO}}. The reason being: without a distributional assumption, \widehat{\text{elpd}}_{\text{LOO}} doesn’t know what degree of variability is “expected” and what degree of variability is “unexpected”, so it simply adds all the pointwise variances together to form an aggregate measure. OTOH if the analyst happens to know what the conditional distribution of y is, then (s)he knows better than loo() whether se_elpd_loo or se_diff truly is “large”, or merely what you’d normally expect from this distribution.
(everyone’s input is welcome, not just avehtari’s)