Relationship between elpd differences and likelihood ratios

Hi, I am trying to understand how elpd fits into the model comparison picture and specifically how it relates to (traditional) log-likelihood ratios.

I sometimes compare models by the ratio of their log-likelihoods obtained by leave-on-out cross-validation (or k-fold cross-validation as an approximation):

\operatorname{LLR}_{1,2} = \log\left(\frac{\operatorname{L_1}}{\operatorname{L_2}}\right) = \log(\operatorname{L_1}) - \log(\operatorname{L_2})

Likelihoods are usually calculated using a point estimate for \theta (often obtained by MLE/MAP estimation through optimization). Assuming that \theta_{-i} is the parameter estimate obtained by fitting the model using all data points except y_i, the likelihood can be written as follows (for discrete outcomes):

\operatorname{L} = \sum^{N}_{i=1}{} \operatorname{Pr}(y_{i}|\hat \theta_{-i})

This seems to be awfully similar to how the elpd is calculated, assuming we draw S samples from the posterior obtained by using all data points except y_i (again, for discrete outcomes):

\widehat{\mathrm{elpd}}_{i}=\log \left(\frac{1}{S} \sum_{s=1}^{S} Pr\left(y_{i} \mid \theta_{-i,s}\right)\right)

\widehat{\mathrm{elpd}}_{\mathrm{full}}=\sum_{i=1}^{N} \widehat{\mathrm{elpd}_{i}}

\widehat{\operatorname{elpd-diff}}_{1,2} = \widehat{\mathrm{elpd}}_{\mathrm{full,2}} - \widehat{\mathrm{elpd}}_{\mathrm{full,2}}.

So could we say that elpd differences are a “full Bayesian version” of a log-likelihood ratio that accounts for (a) the prior distribution and (b) uncertainty about \theta? As far as I can tell, elpd differences and log-likelihood ratios should be identical for flat priors and symmetric posteriors, right?

If you change this to:

\log \left(Pr(y_i|\hat{\theta}_{-i} \right)

then you will have replaced the integral over the posterior distribution in equations 4 and 5 of @avehtari et al with the maximum likelihood estimate. I think this is the crux of your question. Note, however, that in your original post you wrote down something different, because in your expression for L you sum the leave-one-out likelihoods over points but you need to take the product (or sum the logarithms).

Not in general. The likelihood evaluated at the MLE does not bear any strict relationship to the average likelihood integrated over the posterior distribution.

Edit: Here’s the Vehtari, Gelman & Gabry paper I mentioned above

3 Likes