Loo/cross-validation with correlated data

I have N data points which are most easily modeled as a multivariate normal distribution, with the mean/covariance dependent on the parameters of the model \theta i.e.
y \sim \mathcal{N}(\mu (x;\theta),\Sigma(\theta))

My question is, is there a way to cross-validate this model effectively? I see that the Vehtari, Gelman, and Gabry paper on loo makes the assumption that the data are independent conditional on the parameters. Are there alternatives that would work for this case?

1 Like

This will work with LOO out of the box, as long as all the y vectors are independent (conditional on the parameters). Correlations between the components of the y vectors won’t cause errors in the results. As an example, if y is a vector consisting of [wealth, income] where wealth and income are correlated, this will be ok. However, if you have several vectors of [wealth, income] coming from the same family, such that the vectors are correlated with each other by common confounders (e.g. rich parents tend to have rich children), LOO will not be an accurate representation of the out-of-sample error.

Apologies that I didn’t clarify this in the initial post, but I’ve got only got the one data vector, with about N=1000 data points in it. The other thing I guess I should specify is that my basic googling has mostly turned up suggestions that I try to select fitting and hold-out sets for cross-validation such that correlations are small between them, but my goal is specifically to evaluate the correlation structure of the data; in particular, whether various mitigation strategies (based on external datasets) are successful in removing the correlations between our measurements (which are a systematic uncertainty in our final analysis), and where the mitigations really only affect the mostly strongly correlated part of the data vector. Thus it seems to me that if I select hold-outs that are (effectively) uncorrelated with the fitted data points, I won’t have a useful answer to the question I’m asking (i.e. are the mitigations really working?). Is there a sensible way to go about answering these questions?

Maybe Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student- t models helps?

3 Likes

Sorry, but I had a follow up question about this paper. This worked perfectly when I was using a normal distribution, but I’m now considering whether using a multidimensional t-distribution would be better suited to the data set. However, I’m encountering a difficulty implementing the procedure given, specifically Eq. 18. It seems like \Sigma is an N\times N matrix, while the vector y_{-i}-\mu_{-i} is a vector of size N-1, so the multiplication described here doesn’t work. Is there something I’m missing, or is there a typo in the equation?

ping @paul.buerkner

Yes, I think it should have been \Sigma^{-1}_{-i} (instead of \Sigma^{-1}) in the sense that we first compute the full inverse \Sigma^{-1} and then remove the $i$th row and column. Sorry for that mistake.

This is then also consistent with the proof of Proposition 3.

1 Like