Interpreting elpd_diff - loo package

Question is a straightforward, but unfortunately the answer is not.

Short answer: The difference of 1 SE is definitely too small. If n is not small, and there is no bad observation model misspecification (ie. you have made model checking), and there are no khats>0.7, then as a rule of thumb, I would say that difference of 5 SE starts to be on the safe side.

A bit longer answer: Firstly, we need to have all PSIS khats<0.7, so that Monte Carlo error is not dominating (in the forthcoming version of loo package, we’ll provide also an estimate for this Monte Carlo error). Secondly, we know that SE estimate for a single model is optimistic with small n and in case of model misspecification (Grandvalet and Bengio, 2004). Grandvalet and Bengio (2004) show theoretically that true SE is less than 2 times the estimate. There is no similar result for model comparison, but we could assume it would be similar (we are researching this). The problem is further complicated as the uncertainty in the comparison is not necessarily well described by normal distribution with some SE, and especially for small n it would be better to take into account skewness and kurtosis, but it’s not so easy. We are researching ways to improve SE estimate and improve calibration of loo estimates. While you wait for new research results (and a better reference to cite), I would then suggest using 5 x SE, where I picked 5 as 2 x 2.5, where 2.5 would correspond to 99% interval, and 2 is the upper limit of error given by Grandvalet & Bengio (2004).

Instead of difference and se, you could also compute Bayesian stacking weights ([1704.02030] Using stacking to average Bayesian predictive distributions and soon available in loo package), and if the weight of a model is 0, it is worse than the models with positive weight.