Loo/cross-validation with correlated data

darcykenworthy · June 4, 2021, 4:37am

I have N data points which are most easily modeled as a multivariate normal distribution, with the mean/covariance dependent on the parameters of the model \theta i.e.
y \sim \mathcal{N}(\mu (x;\theta),\Sigma(\theta))

My question is, is there a way to cross-validate this model effectively? I see that the Vehtari, Gelman, and Gabry paper on loo makes the assumption that the data are independent conditional on the parameters. Are there alternatives that would work for this case?

Carlos_Parada · June 4, 2021, 9:58pm

This will work with LOO out of the box, as long as all the y vectors are independent (conditional on the parameters). Correlations between the components of the y vectors won’t cause errors in the results. As an example, if y is a vector consisting of [wealth, income] where wealth and income are correlated, this will be ok. However, if you have several vectors of [wealth, income] coming from the same family, such that the vectors are correlated with each other by common confounders (e.g. rich parents tend to have rich children), LOO will not be an accurate representation of the out-of-sample error.

darcykenworthy · June 5, 2021, 4:02am

Apologies that I didn’t clarify this in the initial post, but I’ve got only got the one data vector, with about N=1000 data points in it. The other thing I guess I should specify is that my basic googling has mostly turned up suggestions that I try to select fitting and hold-out sets for cross-validation such that correlations are small between them, but my goal is specifically to evaluate the correlation structure of the data; in particular, whether various mitigation strategies (based on external datasets) are successful in removing the correlations between our measurements (which are a systematic uncertainty in our final analysis), and where the mitigations really only affect the mostly strongly correlated part of the data vector. Thus it seems to me that if I select hold-outs that are (effectively) uncorrelated with the fitted data points, I won’t have a useful answer to the question I’m asking (i.e. are the mitigations really working?). Is there a sensible way to go about answering these questions?

avehtari · June 8, 2021, 7:37pm

Maybe Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student- t models helps?

darcykenworthy · July 13, 2021, 6:38pm

Sorry, but I had a follow up question about this paper. This worked perfectly when I was using a normal distribution, but I’m now considering whether using a multidimensional t-distribution would be better suited to the data set. However, I’m encountering a difficulty implementing the procedure given, specifically Eq. 18. It seems like \Sigma is an N\times N matrix, while the vector y_{-i}-\mu_{-i} is a vector of size N-1, so the multiplication described here doesn’t work. Is there something I’m missing, or is there a typo in the equation?

avehtari · July 19, 2021, 8:55am

ping @paul.buerkner

paul.buerkner · July 19, 2021, 12:36pm

Yes, I think it should have been \Sigma^{-1}_{-i} (instead of \Sigma^{-1}) in the sense that we first compute the full inverse \Sigma^{-1} and then remove the $i$th row and column. Sorry for that mistake.

This is then also consistent with the proof of Proposition 3.

Topic		Replies	Views
Model comparison between independent normals and multivariate normals Modeling	11	160	October 21, 2024
WAIC and LOO-CV for not identically distributed data? Modeling techniques , posterior-predictive	12	1090	October 7, 2021
Inquiry on the article: Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-t models Modeling loo	2	61	January 20, 2025
Resolving fully bayesian uncertainty quantification and LOO cross validation Modeling	1	352	August 24, 2021
Using `loo` for clustered data General loo , validation , cross-validation	1	531	September 12, 2022

Loo/cross-validation with correlated data

Related topics