Hi, I am trying to understand how elpd fits into the model comparison picture and specifically how it relates to (traditional) log-likelihood ratios.
I sometimes compare models by the ratio of their log-likelihoods obtained by leave-on-out cross-validation (or k-fold cross-validation as an approximation):
\operatorname{LLR}_{1,2} = \log\left(\frac{\operatorname{L_1}}{\operatorname{L_2}}\right) = \log(\operatorname{L_1}) - \log(\operatorname{L_2})
Likelihoods are usually calculated using a point estimate for \theta (often obtained by MLE/MAP estimation through optimization). Assuming that \theta_{-i} is the parameter estimate obtained by fitting the model using all data points except y_i, the likelihood can be written as follows (for discrete outcomes):
\operatorname{L} = \sum^{N}_{i=1}{} \operatorname{Pr}(y_{i}|\hat \theta_{-i})
This seems to be awfully similar to how the elpd is calculated, assuming we draw S samples from the posterior obtained by using all data points except y_i (again, for discrete outcomes):
\widehat{\mathrm{elpd}}_{i}=\log \left(\frac{1}{S} \sum_{s=1}^{S} Pr\left(y_{i} \mid \theta_{-i,s}\right)\right)
\widehat{\mathrm{elpd}}_{\mathrm{full}}=\sum_{i=1}^{N} \widehat{\mathrm{elpd}_{i}}
\widehat{\operatorname{elpd-diff}}_{1,2} = \widehat{\mathrm{elpd}}_{\mathrm{full,2}} - \widehat{\mathrm{elpd}}_{\mathrm{full,2}}.
So could we say that elpd differences are a “full Bayesian version” of a log-likelihood ratio that accounts for (a) the prior distribution and (b) uncertainty about \theta? As far as I can tell, elpd differences and log-likelihood ratios should be identical for flat priors and symmetric posteriors, right?
If you change this to:
\log \left(Pr(y_i|\hat{\theta}_{-i} \right)
then you will have replaced the integral over the posterior distribution in equations 4 and 5 of @avehtari et al with the maximum likelihood estimate. I think this is the crux of your question. Note, however, that in your original post you wrote down something different, because in your expression for L you sum the leave-one-out likelihoods over points but you need to take the product (or sum the logarithms).
Not in general. The likelihood evaluated at the MLE does not bear any strict relationship to the average likelihood integrated over the posterior distribution.
Edit: Here’s the Vehtari, Gelman & Gabry paper I mentioned above
4 Likes
I’m resurrecting the thread because my question is so similar.
The oft-cited rule of thumb is that Bayesian elpd_loo differences less than |4| are small. It seems to me that in comparison to the traditional frequentist likelihood-ratio test, this guideline is very conservative.
Like the original poster, I too am referring to a context of a binary/categorical response. Assuming weak priors, traditional wisdom holds that each additional (population-level) parameter overfits by 1 unit on the log score scale, i.e. by 2 units on the Deviance scale. In simulated binary/categorical data with N(0, 2.5) priors, I have found this to be correct for both exact frequentist LOO-CV and Bayesian elpd_loo/looic.
Thus, for an added parameter’s contribution to exceed its overfitting, it must reduce in-sample deviance by 2 units or more. This corresponds to a likelihood-ratio test p-value of 0.157, which has indeed been proposed as a threshold for variable inclusion by some frequentist statisticians.
Contrastingly, for an added parameter’s contribution to both exceed its overfitting and improve out-of-sample log score (deviance) by 4 (8), thus ceasing to be “small”, it must reduce in-sample deviance by 2+8 = 10 units or more. This corresponds to a likelihood-ratio test p-value of 0.00157.
So whereas traditionally a LL (deviance) improvement of 1 (2) was enough for a parameter to be regarded as having predictive potential (improving AIC/LOO-CV), the new Bayesian guideline tightens this requirement by a multiple of 5. Am I missing something?
This is based on the theoretical and experimental analysis in Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison. The uncertainty in elpd_loo differences can be quantified, and people often use the summary diff_se to decide whether the difference is small or big compared to the uncertainty. That paper shows that when the magnitude of the difference is smaller than 4, the normal approximation is often not good and diff_se is underestimated. Thus, that 4 refers specifically to that case. In addition, (but not analysed in that paper), if the magnitude of the difference is larger than 4, then corresponding LOO-weights (aka pseudo-BMA-weights) are close to 0 and 1 (e.g. with 4 the weights are about 0.02 and 0.98). If the weights are not close to 0 and 1 (e.g. with 2 the weights are about 0.12 and 0.88), instead of selecting one model we should do model averaging, model expansion, or overall think more carefully, to avoid model selection induced overiftting/bias.
Thus, the use of 4 is different from the threshold you mention for the classic likelihood ratio, intentionally aiming for more safe workflow where in the case of magnitude of difference less than 4 is indication that more thinking is needed. If that further thinking provides justification to proceed with model selection with a magnitude of difference being smaller, then that’s fine as long as these justifications are made explicit.
Did this and that paper clarify the issue? Let me know if the above brief explanation or the paper is unclear, and I can try to improve them
5 Likes