Can I use LOO in posterior predictive checking?

I sometimes get confused that similar notation p(y^{rep} | y ) and p(y^{new} | y ) occurring in posterior predictive checking and cross-validation separately. In many cases, at least iid, the replication of y is conceptually identical to the future new data. In light of this, can I repace the posterior predictive distribution

p(y^{rep}_{i}|y) = \int p(y^{rep}_i|\theta)p(\theta|y) d\theta

by its loo version

\int p(y^{rep}_i|\theta)p(\theta|y_{-i}) d\theta

and conduct PPC afterwards baed on it? We also know how to make cheaper approxiamtion through psis.

The motivation is that PPC in principle does not see overfitting. A horrible model only assigns Dirac measure at observations will pass PPC. Using LOOlized PP, there are no concerns about using data twice, and no need to calibrate the p-value (PPP) – as the calibration can be difficult and often omitted.

This is not anything new. For example, in @avehtari et al’s LOO-GLVM paper, they mentioned probability integral transform F(y_i |x_i, D_{-i})\sim U[0,1], which almost amounts to PPP (choosing test statistics to be y)?

In the high level, PPC is only aimed at model-misspecification. In the pure Bayesian framework, overfitting is ambiguous especially when there is no obvious future sampling. Nevertheless, it is also reasonable to treat over-fitting as part of model-misspecification, for overfitting may be a result of prior-misspecification, which should be able to be diagnosed from PPC?

LOO gives you marginal distributions p(y_i|y_{-i}) so you can use it with test statistics which use just the marginals, but you can’t use it with test statistics which use joint distribution. For example, consider an example from BDA3 Ch 6 of series of binary observations. The test statistic is the number of switches from 0 to 1 and from 1 to 0. The predictions with binomial model have some distribution for the number of switches, but if the series has high autocorrelation the number of switches can be much different from independent observations. In this case LOO is not applicable, and you need to replicate from the joint distribution p(y^{\rm rep}| y).

This test statistic uses just marginals and works fine with LOO. This is especially useful with flexible models, where most of the observations are influential and most p(y_i|y_{-i}) are clearly different than p(y_i|y) (like GPs with more than few dimensions).

1 Like

Yes, using LOO in non-iid data by itself is not straightforward. If I really use LOO to adjust posterior of \theta by taking into account all of the joint distribution, I will get loo posterior = full posterior / \prod likelihood = prior. Or in other words PPC + joint LOO=marginal likelihood.

Nevertheless, in many cases, we do have factorizable models

p(y|\theta)=\prod p(y_i|\theta),

Sure, it does not imply the marginal y^{rep}|y is factorizable. LOO is only able to do PPC with test statistics T=T(y^{loo}_i, \theta), or anything that is linear under expectation, such as T=\sum y^{loo}_i.

But for this factorizable model, is there really any information that is not included in the marginals y^{rep}_i|y? Or put it in another way, if I run classic PPC and passes all marginal tests with test statistics T_i=T(y^{rep}_i, \theta), i=1,\dots, n, what else should I expect? (multiple testing is another problem, but let’s assume it is also solvable).

Ok, another example just to illustrate that what I wrote was not specific to time series (and I don’t find iid vs. non-iid helpful here). Light speed example in BDA3 Ch6 has fully factorizable model, but test statistic minimum makes sense only jointly for replicates. For marginals minimum is same as y_i itself.

This is also problematic as the loo-posteriors are different so you lose the dependence in the joint posterior.

Generally y_i|y are not independent and marginals do not contain the whole information.