Model checking & comparison using loo vs loo_compare

hello, I have fitted two models. One model nests another model. If I were doing Frequentist, I could compare them using likelihood ratio test.

The loo output from model1 is:

Computed from 4000 by 42 log-likelihood matrix

         Estimate   SE
elpd_loo   -123.7  9.0
p_loo         3.3  0.5
looic       247.3 17.9
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

The loo output from model2 is:

Computed from 4000 by 42 log-likelihood matrix

         Estimate   SE
elpd_loo   -122.2  9.2
p_loo        17.8  1.4
looic       244.3 18.5
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      5    11.9%   148       
 (0.5, 0.7]   (ok)       25    59.5%   92        
   (0.7, 1]   (bad)      12    28.6%   51        
   (1, Inf)   (very bad)  0     0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

The loo_compare output is:

       elpd_diff se_diff
model2  0.0       0.0   
model1 -1.5       0.6

Is there enough information to determine which model is better and how? Thanks.

“Better” is a very relative term, but I’m assuming from the general context of this question that you want to identify a model with the “best” overall fit to the data while balancing parsimony and overfitting. The very short answer is that, yes, this information is sufficient to make a decision; however, the results you provide here indicate that the comparison shouldn’t be trusted.

Before getting to why the comparison isn’t currently valid, we can start with what the results would imply if this weren’t a concern. The first thing to note is that output of loo() gives you three different estimates plus standard errors for each: elpd_loo = expected log-likelihood of the approximated leave-one-out cross-validation results, p_loo = number of effective parameters, and then looic = -2*elpd_loo or the information criterion based on the estimated log-likelihood function. Since most people come to these analyses familiar with information criteria like AIC and BIC, the LOOIC value is usually easiest to look at since it is interpreted in largely the same way (namely: smaller = better). Just going off of the results from the loo() call, it would seem that model2 produces the “better” fit since 244.3 < 247.3.

Now, one of the perks of Bayesian analysis is that you get a lot of valuable information within the posterior. The simple conclusion above answers the equally simple question “Is one model’s LOOIC smaller than the other?” A more interesting and likely valuable question would be “Is one model’s LOOIC notably smaller than the other?” If you’d like to take the standard null hypothesis testing language here, then this can be considered a question of statistical significance. To answer this question, we need to know not just the difference between the two models’ LOOICs but also the standard error of that difference. Fortunately, the loo_compare() function handles this for you, though it’s really not too challenging when you have the posterior draws. One thing to notice about the results of the loo_compare() call is that it gives you the difference in the elpd, not the looic. This isn’t a big deal because LOOIC = -2*elpd. The “best” fitting model will appear in the top row whenever you use loo_compare() and will always have a difference of 0.0 (since there is no difference in it’s elpd from the best fitting model’s elpd as they are one and the same). Using your example, model2 fits the best and the difference between model2 and model1 is -1.50 with a standard error of this difference = 0.6. We can think of the standard error as a way of determining whether this difference of -1.50 is sufficiently large as to conclude that the difference in model fits is indeed notable from a statistical perspective. Adopting the standard though very arbitrary standard of p < 0.05, multiplying 0.6 by 1.96 (the z value corresponding to the two-tailed p < 0.05 standard) gives +/- 1.176 as the cutoff for a “significant” difference here. Since |1.50| > |1.176|, the results of loo_compare() would suggest that model2 produces “statistically significant” better fit.

Now, as for why your results shouldn’t be trusted: the Pareto K statistics from loo(model2) indicate that the LOO approximation is not to be trusted. It seems like you only have n = 42 observations, so actually doing a real LOO where the model is fit iteratively using n-1 observations might not be too computationally challenging; however, in many cases, a true LOO cross-validation is impractical. What the LOO package does to circumvent this is to rely on importance sampling of cases in order to reduce the number of observations that need to be considered while still obtaining a good approximation of the true LOO cross-validation result (versus something like a K-folds paradigm where the data are segmented but without respect to the importance of individual observations, leaving information behind that could make the approximation more accurate). When the importance sampling procedure is valid, the loo() results are good approximations of the true LOO values, but when the importance sampling is not valid, then the results cannot be trusted. Utilizing a Pareto distribution, the LOO package’s implementation of the importance sampling is able to check whether this may or may not be an issue. In your case, there are 12 observations with elevated Pareto K statistics, and the results from a simple loo() call should not be treated as valid.

There are usually a few options to troubleshoot a bad LOO result. I believe that both rstanarm and brms provide informative warnings about how to proceed when a call to loo() gives you an untrustworthy result. Providing additional details about what packages you’re using and/or your code can help others on the forum clarify what might be the issue. Just from looking at these results, it seems like the additional complexity of model2 produces overfitting from additional flexibility that the relatively small sample size can’t overcome – maybe the model adds interactions?

I tried to keep this overview really simple, but there is a lot that you can learn from the results of loo(). @avehtari is quite active on this forum and is the one behind much of the implementation of LOO within the Stan environment, so you should certainly familiarize yourself with his publications on the subject as well as search for his responses to similar questions on this forum. In particular, I’d encourage you to take a look at this FAQ and this thread where a lot of the additional information that can be obtained from LOO results are described in addition to common causes of error and troubleshooting advice.

Hopefully this is helpful!


Hi. Thank you and I appreciate your detailed answer. It certainly helps to understand many things.

I use cmdstanr.

I am still trying to understand your concerned about the loo(model2) output. I get that it may be suggesting that model2 do not fit the data well or perhaps too well and leading to over-fitting. Are you basing on this to conclude that one should ignore the loo_compare output and choose model1 as the better fitting model? Thanks.

Without knowing more about the model and results, I can’t say which model is preferred via LOO. The reason that the loo_compare() result isn’t informative is because only one of the models has a trustworthy LOO result. It might be that model2 is indeed the better model, but you can’t say conclusively until you know that you can trust the comparisons. In this case, the Pareto K statistics indicate that the loo(model2) call is problematic, so using loo_compare() with he object from the loo(model2) call precludes interpreting the result.

As for why I think the problem might be to do with overfitting is the change in the number of effective parameters. According to the LOO results, model1 has 3.3 effective parameters while model2 has 17.8, which is a pretty sizable jump. Because I don’t know anything about the model that you’re fitting, I can only guess as to what the actual number of parameters are; however, I think it is a reasonable guess that the number of parameters (p) is relatively large in the context of sample size (i.e., p > N/5). When Pareto K values are elevated and p_loo < p (where p is larger relative to n), it is usually a sign that the model is overly flexible. The FAQ I linked to in the previous comment includes reference to the LOO glossary where interpretations of p_loo when Pareto K is elevated (among many other useful subjects), so in case you didn’t follow the link to that glossary, here is a direct link: LOO package glossary — loo-glossary • loo.

Without knowing any additional details about your model or your data, the only thing I could suggest at this point is to see whether you can reliable results from loo(model2). Since you’re using cmdstanr, you should be able to do the following to get moment matched LOO results, which can sometimes remove the issue of problematic observations:

loo_model2 <- model2$loo(moment_match = TRUE)

This assumes that you have a parameter named log_lik estimated as part of the model, so obviously change this as needed. Additional details about the use of LOO with cmdstanr fit objects are here: Leave-one-out cross-validation (LOO-CV) — fit-method-loo • cmdstanr


I am running a survival model. The simpler model is one without heterogeneity (aka, random effects). The more complex model has heterogeneity. But since there is no closed-form solution to integrate out the heterogeneity distribution, I am creating individual-level parameters. I don’t fully understand how much this coding approach is altering the number of parameters. If I could integrate out the heterogeneity distribution, then I would be just adding an additional parameter.

This assumes that you have a parameter named log_lik estimated as part of the model, so obviously change this as needed. Additional details about the use of LOO with cmdstanr fit objects are here: Leave-one-out cross-validation (LOO-CV) — fit-method-loo • cmdstanr

So I already have log_lik in my stan program. But I still got an error using moment matching for loo. It is tripping up at the individual-level parameter.

I appreciate your help. This is just a toy example with a toy dataset for me to learn. Your comments help so much already.

@wgoette provided an excellent answer, and I’ll just add a few comments

  • Even without any high khats, as the number of observations is small (n << 100) and the elpd_diff is small (<4), it is likely that se_diff is underestimated and thus the normal approximation based “significance” result can not be trusted. See details in [2008.10296v3] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison. The differences smaller than 4 (in elpd scale) are usually negligible, so there is not any practical difference between the estimated model performances.
  • When we get khat warnings, the provided elpd_loo estimate is likely to be optimistic. As the difference is small, it is very likely that if the cross-validation computation would be fixed, that the difference would still be small or the more complex model would be worse. Thus, we can be quite certain, that the more complex model is not better (but also probably not worse).
  • “the number of effective parameters” should be “the effective number of parameters” (the first one implies that only some of the parameters are effective, while the latter one implies that all parameters together have some effectiveness)
  • Moment matching is unlikely to help if there are parameters that get likelihood information only from one observation (e.g. individual level parameter and only on observation per individual). This is discussed in the moment matching paper Implicitly adaptive importance sampling | Statistics and Computing.
  • Integrated LOO (see. e.g. tutorial by @andrewGhazi - Integrated LOO walkthrough) might help, but as I mentioned above, it is likely that in this case the extra work for improving the LOO computation would not change the conclusions. However, the situation can be different for your real data (ie if the more complex model seems to have much higher elpd_loo and many high khats, integrated LOO might help).
  • Can you be more specific about your survival model? For some survival models, there are corresponding models with overdispersion parameter.
1 Like

Thanks for the suggestions.

I think the biggest challenge is that I am using a toy example and it has a small N, and both of you are saying that a small N is not suitable for loo. I need to find a different dataset with a larger N and then try this again.

I did not say that. I said that the diff_se can be optimistic with small N, which is quite different. Any inference is challenging with small N.

1 Like

It may be useful to compare this with the very wide uncertainty intervals one gets for variable importance when N is small (here importance = relative explained variation). A lot of research involving more than one predictor when N is small is published, and very little of it is reliable.

A machine learning guru at Stanford once said at a conference that machine learning works for any sample size. What rubbish.

1 Like