Cross-validation vs. marginal likelihoods

This is one of the most divisive issues among Bayesians, but this new paper

claims to show that “the marginal likelihood is formally equivalent to exhaustive leave-p-out cross-validation averaged over all values of p and all held-out test sets when using the log posterior predictive probability as the scoring rule”. If so, what are the arguments (besides computational ones) for and against marginalizing over p as opposed to fixing p = 1 like PSISLOOCV does?

It also claims to show that “the log posterior predictive is the only coherent scoring rule under data exchangeability”, which I suspect more people around here will be happy about.


New paper but old idea, which is easily seen from the chain rule and exchangeability. Specifically they also write “This has precisely the form of the the log geometric intrinsic Bayes factor of Berger and Pericchi [1996] but motivated by a different route.” In 1990’s people did experiments with Intrinsic Bayes factors but results were not good and it has been criticized also on from theoretical point of view. The example in the paper is very simple and with 10^6 random training splits you can just think how much time it would take to use this for more complex Stan models. For me it was a bit disappointment as it was not clear whether they are proposing that this could be efficient way to estimate marginal likelihood, or whether the prediction task with varying training size (and no ordering as in time series) would be relevant for some application.

The same argument as for marginal likelihood. Log marginal likelihood with chain rule is same as average of leave-p-out-CV where p goes from n to 1 and no need to take average over permutations (if we start with p<n, then we need to take average over splits as in intrinsic BF). Log marginal likelihood is average of predictive performance given 0,…,n-1 observations or joint predictive performance for n new observations conditional 0 observations. Proponents of Bayes factors often say that this joint prediction is the important one. Considering the first version with predicting one next observation given 0,…,n-1 observations, this can make sense if we are interested how well we could predict using the same model but with new data which size we don’t know but is between 0 and n-1. I have advocated LOO as it would condition the predictions almost with all data, but then this has been also sensible in the applications I’ve worked with.

1 Like