Information criteria: suitable for different response distributions?

I am uncertain whether it is appropriate to compare models with different response distributions (likelihoods) using an information criterion like LOO-IC (or the estimated ELPD instead, but I guess that makes no difference). In this example, (PSIS-)LOO-CV was used to compare a Poisson to a negative binomial model. In contrast, McElreath (2016, section 9.2.4, page 288) states:

Really, all you have to remember is to only compare models that all use the same type of likelihood.

These two sources seem like a contradiction to me. Is the point that in the “roaches” example, a Poisson and a negative binomial model were compared and that the Poisson distribution is a special case of the negative binomial distribution (with dispersion 0)?

References:
McElreath, R. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan (1st ed.). CRC Press.

Since what is badly name as LOO-IC is just -2*elpd, the main difference is whether I’m frowning or smiling when answering the question. See explanation why I don’t like multiplying by -2 in this thread https://twitter.com/avehtari/status/1227900322082906112

This is ok.

That is unfortunate mistake by McElreath and he now knows better (Hopefully in the future all books will be in git repos and when finding a mistake like this we could make an issue or pull request to fix these).

My earlier answer is in the thread Can WAIC/LOOIC be used to compare models with different likelihoods?

Please tell if that answer is not clear. I should include that answer to loo documentation and some books, so it helps if it can be clarified. I know some ways how to clarify it with more explanation, but is it useful in a short version?

6 Likes

Thanks and yes, that answer is clear. Another possible justification that just came to my mind is based on information theory: Doesn’t maximizing the ELPD basically correspond to applying a “minimum entropy” principle? I guess this statement is not 100% true as \text{ELPD} = \int \text{log} \, p(\tilde{y}|y) \cdot p(\tilde{y}) \, \text{d}\tilde{y} and for having a minimum entropy principle, we would need \int \text{log} \, p(\tilde{y}|y) \cdot p(\tilde{y}|y) \, \text{d}\tilde{y}, but I thought it might give an intuition. And to my knowledge, the maximum entropy principle does indeed allow considering different distributions (keeping in mind the caveat about discrete and continuous distributions you mentioned in the older thread). So shouldn’t this also apply to this “kind-of” minimum entropy principle?

Concerning the confusion about the term “information criterion”: I also find the ELPD more intuitive than multiplying it by -2, but I thought that for most readers, the term “information criterion” is more common than “ELPD”. Btw., in the book BDA3 on page 169 (bottom), the text is also suggesting that an information criterion is obtained by multiplying the ELPD by -2. So perhaps that part needs to be updated.

1 Like

Thanks for the feedback.

No, as you also write yourself. Minimum entropy would make a point estimate with zero uncertainty and not care about the discrepancy. elpd is corresponds to minimizing KL-divergence.

I regret that part almost every day. If you read that Twitter thread I linked you can see that the errata and future printings tell to read instead Vehtari, Gelman, and Gabry (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. In Statistics and Computing, 27(5):1413-1432, https://doi.org/10.1007/s11222-016-9696-4. Preprint http://arxiv.org/abs/1507.04544

No, as you also write yourself. Minimum entropy would make a point estimate with zero uncertainty and not care about the discrepancy. elpd is corresponds to minimizing KL-divergence.

Ah okay, I think I got my error in reasoning. Thank you.

1 Like