Dear Stan community, I have a question that I have been making to myself for a long time.
To put in context:
I needed to compare different models to assess a dataset with a response for each patient with several values over time. For instance, one model was random effects Poisson, another was RE Negative Binomial, another one was a multivariate Poisson…, among others. Many of these models assumed different distributions for the response variable. All were longitudinal models, but, for example, one of them only used previous data points to predict only the final endpoint.
LOO-CV converges to the cross-validation score and the Kullback-Leibler divergence, so theoretically, all models using the same data are comparable with this criterion.
However, reading McElreath’s Statistical Rethinking book, he states: “… it is tempting to use information criteria to compare models with different likelihood functions… Unfortunately, WAIC (or any other information criterion) cannot sort it out. The problem is that deviance is part normalizing constant. The constant affects the absolute magnitude of the deviance, but it doesn’t affect fit to data. Since information criteria are all based on deviance, their magnitude also depends on these constants. That is fine, so long as all of the models you compare use the same outcome distribution type… In that case, the constants subtract out when you compare models by their differences. But if the two models have different outcome distributions, the constants don’t subtract out, and you can be misled by a difference in AIC/DIC/WAIC.”
My question is whether McElreath is correct, and it is not possible to use PSIS-LOO to compare different likelihood performances.
However, I found this on the web:
“Yes, you can compare different likelihoods with ICs such as AIC or WAIC, with exceptions. These exceptions are probably the thought underlying the quoted paragraph, but I admit that the text is sufficiently vague to create confusion.
Generally, different likelihoods are comparable (note, by the way, that the use of deviance in the text is a bit confusing because deviance is often defined as the difference to a saturated model, but here it only means log L). However, there are a number of exceptions. Some common situations are:
- Changes in the number of data points
- Changing the scale of the response variable (e.g., doing a log transformation on y)
- Changing the codomain of the probability distribution, e.g., comparing continuous with discrete distributions.”
Thank you in advance for your help.