Various approaches to model comparisons

I didn’t know where to post this but I guess here is probably best. In @paul.buerkner’s paper he writes a footnote saying:

In a Bayesian framework, models may be compared by various means for instance Bayes factors (Kass & Raftery, 1995), (approximate) cross-validation methods (Vehtari et al., 2017), information criteria (Vehtari et al., 2017; Watanabe, 2010) or stacking of posterior-predictive distributions (Yao, Vehtari, Simpson, & Gelman, 2017). A discussion of the pros and cons of these various approaches is outside the scope of the present paper.

So, my question is can you point me to a paper that discusses the pros and cons of these approaches, preferably with some accompanying empirical evidence and not just thoughts?

3 Likes

I’ll offer my (maybe not as humble as it should be) opinion: the idea of comparing Bayes factors and cross-validation, say, seems extremely weird to me. These are approaches that try to capture very different aspects of model fit. @avehtari and others have been quite vocal about the inadequacy of Bayes factors in the so-called M-open setting (see Section 2 here) , where none of the models under consideration is taken as the “true” model. In such a setting, it seems to me they argue cross-validation/stacking is the way to go. I find the critique by Gronau & Wagenmakers particularly compelling against this view, but Vehtari et al. were not amused.

I’m linking to papers by @avehtari, @yuling and @anon75146577 not because I want to explain their work better than them, but because I tend to side with a view somewhat opposite to theirs and closer to Gronau and Wagenmakers. I wanted to state this point of view without leaving out important references.

3 Likes

Read Dani Navarro’s work for an “on the ground” view. Blog: https://djnavarro.net/post/a-personal-essay-on-bayes-factors/ but there’s also a paper.

My view on the G&W critique was that if you’re going to criticize LOO, there are a lot stronger arguments to make (many of which we outlined in our comment).

Generally BF vs LOO is kinda a silly question. It depends on what you want to do. The answer is probably neither. Work out what questions these tools can and cannot answer and use them
appropriately.

7 Likes

Yao, Vehtari, Simpson, and Gelman (2017) has empirical evidence comparing BF, cross-validation and stacking for model averaging both in M-open and M-closed cases.

Piironen and Vehtari (2017) has empirical evidence for model selection comparing BF, cross validation, information criteria and projection predictive (projpred) approach.

3 Likes

The Piironen and Vehtari (2017) paper seems to be what I’m after - thanks :)

2 Likes

I think the right way to think of these things is literally.

Bayes factor is literally the ratio of marginal likelihoods, p(y|M1)/p(y|M2). As such, it has serious problems when p(y|M1) or p(y|M2) are not well defined, as when M1 or M2 has noninformative or weak priors that are assigned arbitrarily.

LOO is literally an estimate of out-of-sample prediction error. There is no literal reason to use it to choose M1 or M2 unless your goal is to have lower out-of-sample prediction error, which you might want in some settings and not others.

Stacking is literally an estimate of a weighted-average model that minimizes out-of-sample prediction error. Again, if that’s your goal, fine.

This idea of treating statistical methods literally can be very helpful. The p-value is literally the probability bla bla bla . . . not a statement about whether a hypothesis is true. The confidence interval is literally a procedure which, at least 95% of the time bla bla bla . . . not a measure of uncertainty. Etc.

13 Likes