Comparing different likelihoods with loo-cv

Dear Stan community, I have a question that I have been making to myself for a long time.

To put in context:
I needed to compare different models to assess a dataset with a response for each patient with several values over time. For instance, one model was random effects Poisson, another was RE Negative Binomial, another one was a multivariate Poisson…, among others. Many of these models assumed different distributions for the response variable. All were longitudinal models, but, for example, one of them only used previous data points to predict only the final endpoint.

LOO-CV converges to the cross-validation score and the Kullback-Leibler divergence, so theoretically, all models using the same data are comparable with this criterion.

However, reading McElreath’s Statistical Rethinking book, he states: “… it is tempting to use information criteria to compare models with different likelihood functions… Unfortunately, WAIC (or any other information criterion) cannot sort it out. The problem is that deviance is part normalizing constant. The constant affects the absolute magnitude of the deviance, but it doesn’t affect fit to data. Since information criteria are all based on deviance, their magnitude also depends on these constants. That is fine, so long as all of the models you compare use the same outcome distribution type… In that case, the constants subtract out when you compare models by their differences. But if the two models have different outcome distributions, the constants don’t subtract out, and you can be misled by a difference in AIC/DIC/WAIC.”

My question is whether McElreath is correct, and it is not possible to use PSIS-LOO to compare different likelihood performances.

However, I found this on the web:

“Yes, you can compare different likelihoods with ICs such as AIC or WAIC, with exceptions. These exceptions are probably the thought underlying the quoted paragraph, but I admit that the text is sufficiently vague to create confusion.

Generally, different likelihoods are comparable (note, by the way, that the use of deviance in the text is a bit confusing because deviance is often defined as the difference to a saturated model, but here it only means log L). However, there are a number of exceptions. Some common situations are:

  • Changes in the number of data points
  • Changing the scale of the response variable (e.g., doing a log transformation on y)
  • Changing the codomain of the probability distribution, e.g., comparing continuous with discrete distributions.”

Thank you in advance for your help.

I think McElreath is partially correct. It’s not always possible, but it is doable in some cases. @avehtari is the expert on this topic. Here are a few of his replies to similar questions:

And I think in this case study he compares a Poisson and Negative binomial, but it’s been a while since I read it:

https://users.aalto.fi/~ave/modelselection/roaches.html#6_Zero-inflated_negative-binomial_model

1 Like

First, the question should be about comparing data models which describe distributions in observation space and not about likelihood, which is unnormalized distribution in parameter space. Different data models may have different parameters and different number of parameters, but if they model the same data, then they have common space for comparison.

Not correct in that sense that WAIC (and LOO) can be used to compare different data models, but then depending on the definition of deviance it is possible to throw relevant information away to make comparison difficult. Are you using the first or second edition of Statistical Rethinking? (If second edition, then I’m certain Richard will fix this in third edition)

For a better answer see CV-FAQ: Can cross-validation be used to compare different observation models / response distributions / likelihoods?. The answer includes links to case studies which illustrate also different non-trivial cases.

3 Likes

[Edit: Oops—didn’t see Aki and Jonah had already responded, and thankfully I was consistent with what Aki said—he’s the expert in these matters (which also agrees with what Jonah said!).]

The idea behind cross-validation is that you’re trying to estimate expected log predictive density (ELPD). This is just p(\widetilde{y} \mid y) where y is “training” data and \widetilde{y} is “new” data. In a parametric model with parameters \theta, this is

\displaystyle \mathbb{E}\!\left[ \strut p(\widetilde{y} \mid \theta) \mid y\right] = \int_{\Theta} p(\widetilde{y} \mid \theta) \cdot p(\theta \mid y) \, \textrm{d}\theta \approx \frac{1}{M} \sum_{m=1}^M p\!\left(\strut \widetilde{y} \mid \theta^{(m)}\right),

where \theta^{(m)} \sim p(\theta \mid y).

Leave-one-out cross-validation evaluates p(y_n \mid y_1, \ldots, y_{n-1}, y_{n+1}, \ldots, y_N). for each N to get an estimate. The LOO package in R does this by using importance sampling rather than refitting for each n.

LOO makes sense whenever the data are exchangeable. Notice that it’s not saying anything about the parametric form that’s getting marginalized out. And the exchangeability is conditioned on data, so that regressions fit this form. A time-series model would not fit this form, and in those cases, you need to do something like leave-future-out to make sense predictively.

Aki’s FAQ is very useful, especially in terms of how the data is divided (Scikit.learn in Python has a similar cross-validation organization): Cross-validation FAQ • loo

2 Likes

I’ve also been advocating leave-future-out (e.g. in Approximate leave-future-out cross-validation for time series models), but in Cross-validatory model selection for Bayesian autoregressions with exogenous regressors we show that it is not the best approach for time series model selection as other approaches may have lower variance without too big bias.

2 Likes