Model checking & comparison using loo vs loo_compare

“Better” is a very relative term, but I’m assuming from the general context of this question that you want to identify a model with the “best” overall fit to the data while balancing parsimony and overfitting. The very short answer is that, yes, this information is sufficient to make a decision; however, the results you provide here indicate that the comparison shouldn’t be trusted.

Before getting to why the comparison isn’t currently valid, we can start with what the results would imply if this weren’t a concern. The first thing to note is that output of loo() gives you three different estimates plus standard errors for each: elpd_loo = expected log-likelihood of the approximated leave-one-out cross-validation results, p_loo = number of effective parameters, and then looic = -2*elpd_loo or the information criterion based on the estimated log-likelihood function. Since most people come to these analyses familiar with information criteria like AIC and BIC, the LOOIC value is usually easiest to look at since it is interpreted in largely the same way (namely: smaller = better). Just going off of the results from the loo() call, it would seem that model2 produces the “better” fit since 244.3 < 247.3.

Now, one of the perks of Bayesian analysis is that you get a lot of valuable information within the posterior. The simple conclusion above answers the equally simple question “Is one model’s LOOIC smaller than the other?” A more interesting and likely valuable question would be “Is one model’s LOOIC notably smaller than the other?” If you’d like to take the standard null hypothesis testing language here, then this can be considered a question of statistical significance. To answer this question, we need to know not just the difference between the two models’ LOOICs but also the standard error of that difference. Fortunately, the loo_compare() function handles this for you, though it’s really not too challenging when you have the posterior draws. One thing to notice about the results of the loo_compare() call is that it gives you the difference in the elpd, not the looic. This isn’t a big deal because LOOIC = -2*elpd. The “best” fitting model will appear in the top row whenever you use loo_compare() and will always have a difference of 0.0 (since there is no difference in it’s elpd from the best fitting model’s elpd as they are one and the same). Using your example, model2 fits the best and the difference between model2 and model1 is -1.50 with a standard error of this difference = 0.6. We can think of the standard error as a way of determining whether this difference of -1.50 is sufficiently large as to conclude that the difference in model fits is indeed notable from a statistical perspective. Adopting the standard though very arbitrary standard of p < 0.05, multiplying 0.6 by 1.96 (the z value corresponding to the two-tailed p < 0.05 standard) gives +/- 1.176 as the cutoff for a “significant” difference here. Since |1.50| > |1.176|, the results of loo_compare() would suggest that model2 produces “statistically significant” better fit.

Now, as for why your results shouldn’t be trusted: the Pareto K statistics from loo(model2) indicate that the LOO approximation is not to be trusted. It seems like you only have n = 42 observations, so actually doing a real LOO where the model is fit iteratively using n-1 observations might not be too computationally challenging; however, in many cases, a true LOO cross-validation is impractical. What the LOO package does to circumvent this is to rely on importance sampling of cases in order to reduce the number of observations that need to be considered while still obtaining a good approximation of the true LOO cross-validation result (versus something like a K-folds paradigm where the data are segmented but without respect to the importance of individual observations, leaving information behind that could make the approximation more accurate). When the importance sampling procedure is valid, the loo() results are good approximations of the true LOO values, but when the importance sampling is not valid, then the results cannot be trusted. Utilizing a Pareto distribution, the LOO package’s implementation of the importance sampling is able to check whether this may or may not be an issue. In your case, there are 12 observations with elevated Pareto K statistics, and the results from a simple loo() call should not be treated as valid.

There are usually a few options to troubleshoot a bad LOO result. I believe that both rstanarm and brms provide informative warnings about how to proceed when a call to loo() gives you an untrustworthy result. Providing additional details about what packages you’re using and/or your code can help others on the forum clarify what might be the issue. Just from looking at these results, it seems like the additional complexity of model2 produces overfitting from additional flexibility that the relatively small sample size can’t overcome – maybe the model adds interactions?

I tried to keep this overview really simple, but there is a lot that you can learn from the results of loo(). @avehtari is quite active on this forum and is the one behind much of the implementation of LOO within the Stan environment, so you should certainly familiarize yourself with his publications on the subject as well as search for his responses to similar questions on this forum. In particular, I’d encourage you to take a look at this FAQ and this thread where a lot of the additional information that can be obtained from LOO results are described in addition to common causes of error and troubleshooting advice.

Hopefully this is helpful!

5 Likes