Loo comparison in reference to standard error




I saw in the loo vignette when comparing models: The difference in ELPD is larger than twice the estimated standard error

Is using the over 2 SE, in function of the approximate 95% of the normal distribution? or is there another reason for this guideline that I am missing? Or are there other recommended guidelines when comparing models with loo and waic?

Thank you for the great work


Good question. Yeah, when well behaved, the elpd_diff is approximately normal. The standard error of the difference is also approximate though so we should probably be recommending more than 2SE to be safe. There’s no “correct” number of standard errors to look for. I think I’ve even seen @avehtari recommend as many as 5SE somewhere!
Aki, is that your general recommendation now?

In one of the vignettes we say

The difference in ELPD is much larger than twice the estimated standard error again indicating that the negative-binomial model is expected to have better predictive performance than the Poisson model.

but it’s not so clear from that text that we are emphasizing the “much larger than”.

We should go through our documentation and make sure there’s more clarity on this issue.


The problem is that the current SE estimate is optimistic, and the theory say that it in the worst case the true SE can be even twice the estimate, so based on that 4SE using the current estimate is safe. We are working on better SE estimate, but it will take some time to test that it really works.


Yeah. I think we can be more explicit about that in the vignettes and maybe other places in the doc. I think all we have right now is in the doc for compare. I’ll open an issue for this.



This is very useful to know. I assume the same happens with the SE for the WAIC comparison, right?

I was planning to do a simulation for model comparison in Bayesian SEM, with the idea of the elpd_diff/SE ratio as the criteria. Would it still be useful to work on something like this with the optimistic SE or is it better to wait for a better SE?

Thank you



It would be useful for understanding the performance with the current SE estimate.
How would you use that ratio?


Sounds good, I am planning to look at the rate of selecting the correct model at different levels of SE, for example at 1SE, 2SE, 3SE, 4SE, 5SE. Originally was planning to stop at 3SE, given this information I might go up to 5SE. For both LOO, and WAIC comparison.

The idea is that if the ration for the model comparison passes the level we say that the change is meaningful to consider it a “better” model

Also, plan to look at the approximate log-bayes factor as a comparison method

Other thing was to look at on ROC to find the best ratio for the best sensitivity and specificity of model selection

I am planning to compare this to the maximum likelihood standard of practice, which is likelihood ratio test

You said that it will take you some time to have and test a better SE, you have some type of timeline for this? Just to take it in consideration for this type of projects?

Thank you


I’m not fan of trying to select the correct model, as in real life the real model often includes effects which are too small to be well identified and then it’s not about selecting the true, but selecting something you can estimate well enough to be useful.

LOO and WAIC have high variance, which is a problem if you try to detect small differences which happens f you are comparing models which are quite similar to each other. See http://link.springer.com/article/10.1007/s11222-016-9649-y and https://github.com/avehtari/modelselection_tutorial. Using higher multiplier for SE, just makes it more liley that you get stuck with your baseline.

Can be prior sensitive, and not the approach with the smallest variance http://link.springer.com/article/10.1007/s11222-016-9649-y

There’s a problem that bias of SE depends on n and “outliers”, and thus there can’t be any “best ratio”

Sumamry: LOO (and WAIC, but since WAIC is more difficult to diagnose, I don’t recommend it) is ok for model comparison when there is a small number of comparisons. LOO (and WAIC, but…) can detect reliably only relatively big differences in the predictive performance.

After summer. This will make SE better calibrated, but it doesn’t solve the problem of relatively high SE


Here’s some additional clarification (hopefully). Assume first that SE estimates were calibrated. Choosing the stopping rule based on 1, 2, 3, or 4 SEs make the difference in balance between bias and variance. With a more strict stopping rule you tend to choose smaller models (assuming you favor smaller models) which do not include all the relevant variables and thus may have less than optimal predictive performance. With a less strict stopping rule you tend to choose larger models which may include irrelevant variables, and due to high variance of CV (and waic) the selection process overfits, and although you think you are finding better models the independent test test predictive performance gets worse. Bayesian predictive methods for model selection shows examples of search paths in variable selection, and the stopping rule based on 1, 2, 3, or 4 SEs would then modify where to stop. Now which level 1-4 would give you the highest proportion of true model selected would depend on the shape and the variance of the performance curve. If that curve has a sharp elbow, then strict rule favoring small models works well, but if that elbow is not sharp and at the same time variance is not too big then less strict rule would likely to work better. This all means that you have to be very careful designing the experiment so that the bias in the selection doesn’t favor a specific setting which is not realistic compared to real problems you are interested in. Then assume SE estimates are not calibrated and the bias depends on the data and model. Then it means that experimental results are even less transferable to other problems as with strong bias you should use larger multiplier. We can improve the calibration of SEs, which helps, but it doesn’t remove the other problems above.

However, I recommend to continue running some experiments as you had planned, as that is a great way to learn more about this problem.



This is very helpful. I agree in applied research there is never a “correct” model, but useful models. My intention with this simulation study is to start giving some guidelines on how to quantify the magnitude of change in LOO based on the SE. Will look to include these characterictics you mention as conditions in the simulations.

I am aware of the prior sensitivity of the bayes factor, honestly I dont like it for model comparison. But still is what I see mention and recommended most often. By adding it I expect to show that LOO is a better method.

Appreciate the references, these will be very helpful.

Thank you very much