WAIC and LOO-CV for not identically distributed data?

Hi Stan-Community :),

first of all, thank you very much for this great software and the whole ecosystem around it!
Very helpful for researchers of all kind.

I have a question about the assumptions behind WAIC and LOO-CV.
Under what circumstances can these methods be used?

Especially I am interested if one can use these techniques in the case that the data is independent but not identically distributed.


  • In my analysis, it is assumed to a good approximation, that the data points are independent distributed.
  • However, the assumption of identically distributed data is not valid!
  • An example for the used data can be seen here: example.pdf (22,4 KB)

I am grateful for any help!


Edit: what is written here is not entirely correct. See below for more.

LOO-CV can be expected to work as long as:

  • It passes it’s own diagnostics (Pareto-K’s not too high)
  • The points are conditionally independent, such that the likelihood is the product of the pointwise likelihoods.
  • At least one candidate model is reasonably well specified.

Thus, if your data aren’t identically distributed, but your models assume that they are identically distributed, then you might run afoul of the final bullet point above. On the other hand, if your models are sufficiently flexible to capture (a good approximation to) the generative process, then you’ll be fine. However, when data are not identically distributed, I’d hazard the guess that in general there’s a stronger possibility of strongly influential points in the analysis. These might make it harder for LOO-CV to work, but should be picked up by the Pareto K diagnostics mentioned in bullet 1 above. So in your position I’d prepare for the possibility that LOO-CV might not work too well. A fallback that should more reliably work (provided that bullet points 2 and 3 above are satisfied) is to do brute-force k-fold cross-validation.

Also tagging @avehtari In case I’ve gotten ahead of myself in anything I’ve said!

Thank you for taking the time to write an answer!

But I’am still unsure if cross validation can be applied at all.
I am intrigued what you think about the following points:

As for example stated by Sumio Watanabe (last sentence of the first paragraph of section ‘2.1 Definitions of statistical inference’):

“The cross validation procedure needs the i.i.d. condition, whereas information criteria can be used in several not i.i.d. cases as shown in Watanabe (2021).”

The citation ‘Watanabe (2021)’ refers to the following paper, which is about WAIC for mixture models.

From which I conclude that WAIC is applicable in case of independent but not identically distributed data.
Is this conclusion correct?

1 Like

This stuff is complicated and I’m less than certain that I’ve gotten it right. The best resource I know of is here:


A key point which isn’t elaborated in detail in the post linked above is the difference between “exchangeablility” and “i.i.d.”. But just as an example, I’m reasonably certain that it’s unproblematic to apply LOO-CV to distributional regression models.

Ok, thanks for the link. I will have a look :)

For the purpose of clarification:


  • It is assumed that one has n independent data points y_1, ..., y_n.
  • Furthermore each of these data points follows a normal distribution y_i ~ N(mu_i, sigma_i).
  • Due to different mean and standard deviation for each data point y_i the data points are not identically distributed.

Thus the likelihood can be written as:

p(y | theta) = \prod^n_{i=1} N(mu_i(y_i, theta), sigma_i),

with the parameter vector thetha.

I am still confused about the meaning of all this.

@avehtari and @jsocolar would you please clarify a few more questions:

  1. Does the assumption y_i ~ N(mu_i, sigma_i), automatically implies non identically distributed data.
  2. Or does identically distributed in context of the data generating mechanism (see for example example.pdf given in the first post) means that the data points are identically distributed according to the unknown, underlying true distribution.

y_i ~ N(mu_i, sigma) implies that the residuals are identically distributed.
y_i ~ N(mu_i, sigma_i)implies that the scaled residuals are identically distributed.

In either case, we might be willing to treat the data points as exchangeable if and only we are willing to assume that the distribution of covariates x_i that yield the predictions for \mu_i and \sigma_i is adequately described by the sample of points in our dataset. That is, we can assume that the pairs (x_i, y_i) are exchangeable. But to do this, and to use LOO to predict the performance on future data, we need to additionally assume that future samples from the joint distribution for (x_i, y_i) will look similar to the observed distribution in our data. And this means that we need to assume not only that the true generative model for p(y_i|x_i) will won’t change (this is an implicit assumption that is baked into just about any assessment of predictive performance), but also that the observed distribution of x_i is a good approximation to the future distribution of the new x_i that we would like to predict.

1 Like

Thank you very much for this detailed answer!

@jsocolar’s answer is good. I add

  • The observations don’t need to be identically distributed
  • LOO can be useful for internal model consistency check even without exchangeability assumption
  • If LOO is used to estimate the future predictive performance, then we need to assume some exchangeability between the past data and the future data. The simplest form is conditionally i.i.d. but it’s not required.
  • We can assume that the data generating mechanism is changing, but then we need to model that that change, and it’s possible to combine such model with LOO-CV. This is rarely done, but there are some examples.

Forgot to add

Aki, can you provide links to some examples of modeling change in the data generating mechanism and combining that with LOO-CV?

See, e.g. Importance-Weighted Cross-Validation for Covariate Shift | SpringerLink
You may find more searching with terms covariate shift, cross-validation.
As I said, there are not many examples.

1 Like