â€śBetterâ€ť is a very relative term, but Iâ€™m assuming from the general context of this question that you want to identify a model with the â€śbestâ€ť overall fit to the data while balancing parsimony and overfitting. The very short answer is that, yes, this information is sufficient to make a decision; however, the results you provide here indicate that the comparison shouldnâ€™t be trusted.

Before getting to why the comparison isnâ€™t currently valid, we can start with what the results would imply if this werenâ€™t a concern. The first thing to note is that output of `loo()`

gives you three different estimates plus standard errors for each: `elpd_loo`

= expected log-likelihood of the approximated leave-one-out cross-validation results, `p_loo`

= number of effective parameters, and then `looic`

= -2*elpd_loo or the information criterion based on the estimated log-likelihood function. Since most people come to these analyses familiar with information criteria like AIC and BIC, the LOOIC value is usually easiest to look at since it is interpreted in largely the same way (namely: smaller = better). Just going off of the results from the `loo()`

call, it would seem that `model2`

produces the â€śbetterâ€ť fit since 244.3 < 247.3.

Now, one of the perks of Bayesian analysis is that you get a lot of valuable information within the posterior. The simple conclusion above answers the equally simple question â€śIs one modelâ€™s LOOIC smaller than the other?â€ť A more interesting and likely valuable question would be â€śIs one modelâ€™s LOOIC notably smaller than the other?â€ť If youâ€™d like to take the standard null hypothesis testing language here, then this can be considered a question of statistical significance. To answer this question, we need to know not just the difference between the two modelsâ€™ LOOICs but also the standard error of that difference. Fortunately, the `loo_compare()`

function handles this for you, though itâ€™s really not too challenging when you have the posterior draws. One thing to notice about the results of the `loo_compare()`

call is that it gives you the difference in the `elpd`

, not the `looic`

. This isnâ€™t a big deal because LOOIC = -2*elpd. The â€śbestâ€ť fitting model will appear in the top row whenever you use `loo_compare()`

and will always have a difference of 0.0 (since there is no difference in itâ€™s elpd from the best fitting modelâ€™s elpd as they are one and the same). Using your example, `model2`

fits the best and the difference between `model2`

and `model1`

is -1.50 with a standard error of this difference = 0.6. We can think of the standard error as a way of determining whether this difference of -1.50 is sufficiently large as to conclude that the difference in model fits is indeed notable from a statistical perspective. Adopting the standard though very arbitrary standard of p < 0.05, multiplying 0.6 by 1.96 (the z value corresponding to the two-tailed p < 0.05 standard) gives +/- 1.176 as the cutoff for a â€śsignificantâ€ť difference here. Since |1.50| > |1.176|, the results of `loo_compare()`

would suggest that `model2`

produces â€śstatistically significantâ€ť better fit.

Now, as for why your results shouldnâ€™t be trusted: the Pareto K statistics from `loo(model2)`

indicate that the LOO approximation is not to be trusted. It seems like you only have n = 42 observations, so actually doing a real LOO where the model is fit iteratively using n-1 observations might not be too computationally challenging; however, in many cases, a true LOO cross-validation is impractical. What the LOO package does to circumvent this is to rely on importance sampling of cases in order to reduce the number of observations that need to be considered while still obtaining a good approximation of the true LOO cross-validation result (versus something like a K-folds paradigm where the data are segmented but without respect to the importance of individual observations, leaving information behind that could make the approximation more accurate). When the importance sampling procedure is valid, the `loo()`

results are good approximations of the true LOO values, but when the importance sampling is not valid, then the results cannot be trusted. Utilizing a Pareto distribution, the LOO packageâ€™s implementation of the importance sampling is able to check whether this may or may not be an issue. In your case, there are 12 observations with elevated Pareto K statistics, and the results from a simple `loo()`

call should not be treated as valid.

There are usually a few options to troubleshoot a bad LOO result. I believe that both `rstanarm`

and `brms`

provide informative warnings about how to proceed when a call to `loo()`

gives you an untrustworthy result. Providing additional details about what packages youâ€™re using and/or your code can help others on the forum clarify what might be the issue. Just from looking at these results, it seems like the additional complexity of `model2`

produces overfitting from additional flexibility that the relatively small sample size canâ€™t overcome â€“ maybe the model adds interactions?

I tried to keep this overview really simple, but there is a lot that you can learn from the results of `loo()`

. @avehtari is quite active on this forum and is the one behind much of the implementation of LOO within the Stan environment, so you should certainly familiarize yourself with his publications on the subject as well as search for his responses to similar questions on this forum. In particular, Iâ€™d encourage you to take a look at this FAQ and this thread where a lot of the additional information that can be obtained from LOO results are described in addition to common causes of error and troubleshooting advice.

Hopefully this is helpful!