I was wondering if it was possible to derive/estimate credibility intervals for the LOO/WAIC ELPD difference between models?
Symmetric frequentist 1.96*SE confidence intervals don’t quite seem appropriate when comparing Bayesian models!
Given Gaussian model of the differences, weakly informative prior, and n>20 the sufficient statistics for the Bayesian posterior of the expected difference are practically same as frequentist standard error. Even if the distribution of differences is not Gaussian, in most cases with large n the distribution of the expected differences is close to Gaussian (CLT). Thus you can interpret the SE given by loo package as a measure of Bayesian posterior uncertainty.
It would be possible to use other models, too, and I’ve used also a non-parametric Dirichlet distribution model (aka Bayesian bootstrap), but in many cases the differences are not that big compared to other sources of error. Whatever the interpretation is, there is a complication as pointwise elpd_i’s are not independent and it’s difficult to model that dependency and thus the uncertainty estimates are not perfectly calibrated. See also my comments in the thread Interpreting elpd_diff - loo package
It seems it’s possible to improve over the simplest Gaussian model of the differences, but with small to moderate n, the problem of modeling accurately the tails of the difference distribution remains (and with large enough n, Gaussian model works well enough).
Note that we could get well calibrated estimate for the distribution of the differences if we would know the true model (which is the assumption made also in the frequentist hypothesis testing). We are using cross-validation specifically in those cases where we don’t know the true model, suspect that the models we have might be quite far away from the true model, and we value the robustness in case of model misspecification.