Interpreting output of multiple comparisons using loo

If I use the loo function compare() to compare two models, I get how much the second model is better and standard errors of the measurement.

So if I run and get:


elpd_diff        se 
     31.3       8.1 

I would say that the second model seems to be roughly 4 SE’s better.

But exactly how do I interpret a comparison of multiple models at the same time - specifically:


         elpd_diff elpd_loo se_elpd_loo p_loo   se_p_loo looic   se_looic
loom2Va      0.0   -5146.2     41.5        32.1     0.4  10292.4    83.0 
loom2Vai    -4.0   -5150.2     41.7        39.1     0.5  10300.4    83.3 
loom1Vb    -35.8   -5182.0     40.8        31.0     0.4  10364.1    81.5 
loom1V    -171.6   -5317.8     39.4        25.1     0.3  10635.6    78.8 
loom0V    -202.9   -5349.1     38.7        23.9     0.3  10698.3    77.3

I am all of a sudden getting a lot more info and I am not so sure… What can I conclude in this case?

This may be a question answered elsewhere, but I have not found a simple answer so far so your input is valued.


Unfortunately this is missing diff_se, which would then give you the same information as when comparing two models. The difference is computed to the model with highest log predictive density (elpd_loo). You can still use this see the order, and then you can compare two models at time to check diff_se’s. Before seeing other diff_se’s my guess is that there is uncertainty about the difference between loom2Va and loom2Vai, but these models provide clearly better predictive performance than others.

Changing this output to be more clear is on our (with @jonah) todo list.

1 Like

That makes sense - thank you for your response, avehtari.

I have just analysed them against each other and it seems like you have some good intuition regarding the two bigger models being better than the rest, but similar to each other. May I ask you which way you would prefer this reported?

One way could be to plot all the models against the null model (loom0V) with 2 SEs as error bars:

In this case, however, it seems that there is barely any difference between 3 of the models.

When I compare them pairwise, one can see that the 2 complex models are at least 2 SE’s better than the best model with one predictor, but this may look more confusing.

Do you have any preference or would you choose a completely different way of reporting this?

Edit: While writing this, I was thinking that one could also do as the table in my original post and compare all models against the best model. I think I would prefer that myself:

1 Like

Yes, this is what I recommend, too. And presenting the result as a plot like you did is much better than as a table. It’s easy to quickly see the differences!

You may further consider whether you would have some application specific measure to give more interpretable calibration of the model differences.

1 Like