LOO-CV for large data sets

I have data sets with ~ 300000 observations. LOO-CV requires storing the log-densities of all observations for all iterations and do observation-to-observation comparisons between models. With 4 chains, each with 1000 post-warmup iterations, the memory required to store all log-densities is, I believe, 30000040008=9.6GB. Has anyone worked with LOO-CV for data of similar size? It’s a bit of work to set it up, and the model takes ~1 week to run for some data sets (and I have many data sets to analyze), so it would be great to know if it’s likely to be feasible.

Alternatively, has anyone done LOO-CV on just a random selection of, say, 1000 observations? Of course, support for one model over the other would be weaker, but if there are large differences, it might be enough to demonstrate this. Does this sound like a reasonable approach.

1 Like

This is what loo_subsample is for.

See here for vignette on using it.

Good luck!