I didn’t know where to post this but I guess here is probably best. In @paul.buerkner’s paper he writes a footnote saying:

In a Bayesian framework, models may be compared by various means for instance Bayes factors (Kass & Raftery, 1995), (approximate) cross-validation methods (Vehtari et al., 2017), information criteria (Vehtari et al., 2017; Watanabe, 2010) or stacking of posterior-predictive distributions (Yao, Vehtari, Simpson, & Gelman, 2017). A discussion of the pros and cons of these various approaches is outside the scope of the present paper.

So, my question is can you point me to a paper that discusses the pros and cons of these approaches, preferably with some accompanying empirical evidence and not just thoughts?

I’ll offer my (maybe not as humble as it should be) opinion: the idea of comparing Bayes factors and cross-validation, say, seems extremely weird to me. These are approaches that try to capture very different aspects of model fit. @avehtari and others have been quite vocal about the inadequacy of Bayes factors in the so-called M-open setting (see Section 2 here) , where none of the models under consideration is taken as the “true” model. In such a setting, it seems to me they argue cross-validation/stacking is the way to go. I find the critique by Gronau & Wagenmakers particularly compelling against this view, but Vehtari et al. were not amused.

I’m linking to papers by @avehtari, @yuling and @anon75146577 not because I want to explain their work better than them, but because I tend to side with a view somewhat opposite to theirs and closer to Gronau and Wagenmakers. I wanted to state this point of view without leaving out important references.

My view on the G&W critique was that if you’re going to criticize LOO, there are a lot stronger arguments to make (many of which we outlined in our comment).

Generally BF vs LOO is kinda a silly question. It depends on what you want to do. The answer is probably neither. Work out what questions these tools can and cannot answer and use them
appropriately.

Yao, Vehtari, Simpson, and Gelman (2017) has empirical evidence comparing BF, cross-validation and stacking for model averaging both in M-open and M-closed cases.

Piironen and Vehtari (2017) has empirical evidence for model selection comparing BF, cross validation, information criteria and projection predictive (projpred) approach.

I think the right way to think of these things is literally.

Bayes factor is literally the ratio of marginal likelihoods, p(y|M1)/p(y|M2). As such, it has serious problems when p(y|M1) or p(y|M2) are not well defined, as when M1 or M2 has noninformative or weak priors that are assigned arbitrarily.

LOO is literally an estimate of out-of-sample prediction error. There is no literal reason to use it to choose M1 or M2 unless your goal is to have lower out-of-sample prediction error, which you might want in some settings and not others.

Stacking is literally an estimate of a weighted-average model that minimizes out-of-sample prediction error. Again, if that’s your goal, fine.

This idea of treating statistical methods literally can be very helpful. The p-value is literally the probability bla bla bla . . . not a statement about whether a hypothesis is true. The confidence interval is literally a procedure which, at least 95% of the time bla bla bla . . . not a measure of uncertainty. Etc.