I just heard about Bayes factors for the first time. Why is it such a big deal having to compare the posterior probabilities of two models? Isn’t a model’s posterior probability simply exp(elpd_loo+p_loo)
? Sounds easy to compute a ratio, no? Or are the values simply too small?
Also, why does the Wikipedia article say “an advantage of the use of Bayes factors is that it automatically, and quite naturally, includes a penalty for including too much model structure”? What penalty? I don’t see any penalty in the quotient of two probabilities. Is it referring to priors? But what if your priors are flat? Doesn’t the “penalty” then disappear?
Curious newbies yearn to know…
1 Like
The numerator and denominator in the bayes factor are the likelihoods integrated over the prior (NOT the likelihood integrated over the posterior). See equation 2 here https://www.andrew.cmu.edu/user/kk3n/simplicity/KassRaftery1995.pdf
If an unbounded parameter gets a flat prior, that prior is improper and the density is everywhere zero, so such a model doesn’t work with Bayes factors. If we take a proper prior and take a limit as the prior gets flatter and flatter, the prior density gets minuscule everywhere and the model gets penalized into oblivion. So if the priors are flat, the penalty does the opposite of disappearing. If the priors are very narrow, then the penalty gets smaller, which should be expected, as the model has less, in effect, a less flexible structure.
The tricky part here in the notation of Wikipedia and of Kass & Rafferty is to realize that when they write (here following the Wikipedia notation) \textrm{Pr}(D|M_1) they are not referring to the likelihood for some specific (e.g. draw-wise) value of the model parameters \theta_1 (i.e. \textrm{Pr}(D|\hat{\theta}_1)), nor to this likelihood integrated over the posterior for \theta_1, but rather to the likelihood integrated over the prior. Here M_1 refers to the entire Bayesian model 1, including its prior.
3 Likes
Wow, thanks!
Doesn’t all of this imply an extremely severe penalty for every added parameter? Isn’t it the case that every added parameter expands the parameter space exponentially, so that the “total probability mass” of 1 is spread exponentially thinner, resulting in much lower marginal probabilities unless the added parameter makes an earth-shattering predictive contribution?
And doesn’t it also imply that even with just a single parameter to estimate, any kind of prior (say, normal or uniform) that is symmetrically centered on the true value will result in a lower marginal probability the wider we allow its spread to be? E.g. if we’re estimating a binomial p whose true value happens to be 0.5, a model with a U(0.3, 0.7) prior on p will have a higher marginal probability than an otherwise identical model where the prior is U(0.1, 0.9)?
Yes, if you widen the priors, then assuming that the likelihood gets lower in the tails of the priors the Bayes factor penalizes this widening. This is entirely consistent with the idea of penalizing model complexity/flexibility, because wider priors equal more flexibility. If you are willing to assume that the likelihood gets lower in the tails of your priors (and you think working with bayes factors is a good idea), then you shouldn’t be widening your priors.
Whether adding a parameter results in a penalty (and preference for the simpler model) depends on the competition between two things:
- Does the inclusion of the parameter yield higher likelihoods near the best fitting value?
- Does the prior encompass regions of parameter space over the new parameter where the likelihood is lower than it would be without including that parameter.
For example, if I add a parameter that yields modestly larger likelihoods everywhere in parameter space that is consistent with the prior that I place on it, then there will be no penalty even though the increase in the likelihood is modest. What matters is the change in the likelihood integrated (i.e. averaged) over the prior.
4 Likes
Thanks for coming to the rescue once more, Jacob.
1 Like