Quick examples of loo() interpretation

As a newbie, loo() comparisons cause some confusion. Could you please have a quick comment?

Example 1: m2 is better since elpd_diff is >3 times higher compared to se_diff? Or the difference is insignificant due to so small numbers?

Model comparisons:
                   elpd_diff se_diff
m1                     0.0     0.0  
m2                    -0.5     0.1  

Example 2: m2 is worse as elpd_diff is >3 times higher compared to se_diff

Model comparisons:
                   elpd_diff se_diff
m1                    0.0       0.0  
m2                   -15.5     5.1  

Example 3: models are equal as elpd_diff is not 3-5 times larger compared to se_diff? But what about so large se_diff?

Model comparisons:
                   elpd_diff se_diff
m1                    0.0       0.0  
m2                   -0.3     182.4

Hi di4oi4,

In this case, m1 is actually the better model, as, what you say is correct: it outperforms m2 by more than 3 times the se_diff, if this is what you set as your criterion for calling a model “better”. That the numbers are small does not matter, as the absolute values of elpd are not meaningful in itself and similarly, we cannot judge by merely looking at elpd_diff to see whether it is “small” or “large”. This is why we need to consider se_diff, just as you already suggested.

EDIT: for people checking up on this in the future, @avehtari actually pointed out (below) that this difference can in fact be considered insignificant due to small numbers and that it is possible to interpret absolute numbers. I leave the point above to not break the conversation-flow but want to point out that what I was saying was not 100% correct.

yes this is correct.

Yes your interpretation is correct again. Did you maybe fit different models / data here? elpd_diff being so small while se_diff being so large could indicate that the two models fit the data equally well and that the absolute values of elpd are large, and so are their standard-deviations and hence is se_diff. Again, this is not by itself meaningful, as the absolute value of elpd does not tell us much without comparing it to something.

If you haven’t done so already, consider having a look at the loo glossary for more information about elpd.

2 Likes

Thank you so much! This is very useful information!

Comment about the last example (3). I got so high se_diff due to different model specification:

m1 = y ~ predictor + country
m2 = y ~ predictor + (1 | country)

Predictions of the two models were quite the same, also pp_checks and Rhats. Only the hierarchical model had slightly higher CIs in predictions.

Sorry I was not very clear in my wording above: I meant “different models / data” compared to example 1 and 2. Because, for different data or response variables the values for elpd can drastically change which would explain why the se_diff (and elpd?) might be so much bigger here compared to example 1 or 2…
Either way, your interpretations are correct I think :)

1 Like

I add link to CV-FAQ which has more about the interpretation.

The difference is insignificant due to the small numbers.

Yes.

Likely to have model mis-specification. Do posterior predictive checking.

2 Likes

Thanks for correcting this, I added an EDIT note to my answer above. Just for my own understanding, because I find this surprising: When would you consider elpd_diff values small/large then, as from my understanding, as I mentioned above, I thought that the absolute value does not matter?

See answers 11 and 15 in CV-FAQ. TL;DR the absolute difference has an interpretation.

1 Like

Thank you for the reply!

I did pp_checks. All except the last look similar.
Column 1: m2 = y ~ predictor + (1 | country)
Column 2: m1 = y ~ predictor + country

Why y is different in the last row?
The third row looks a bit suspicious although not completely infeasible.
Are you using just Gaussian model? What is y? Counts with some 0’s, too?

I don’t know why y is different in the last row, though I always used the same models for making these pp_check plots.

This is the full m2 model, family = hurde_lognormal. Received treatment hours is the y variable, there’s a lot of zero inflation.

fit = brm(bf(received_treatment_hours ~ predictor1 + … + predictor9 + (1 | region), hu ~ predictor1 + … + predictor9 + (1 | region), data = data, family = hurdle_lognormal(), cores = 3, chains = 3)

Interestingly, the predictions I am interested in, are consistent between the models and also similar to the splitted analysis (lognormal model and binomial model). What’s also interesting, that hierarchical structure did not give high se_diff values while comparing lognormal/binomial models in splitted analysis, but they became significant with hurdle models.

Do you mean “they (se_diff) became high with hurdle models”? Are you certain that the observations are in the same order for both models? That one plot raises doubt, and different orders could explain the high se_diff.