Lppd > log likelihood. How is this even possible?

This a simple single-level Bernoulli model with weakly regularizing priors. Nothing fancy. Yet somehow its lppd is higher than the LL of the corresponding frequentist model, as seen below:

> baymod <- brm(y ~ x1 + x2 + x3 + x4, family = bernoulli, prior = prior(normal(0,2.5))+prior(normal(0,0.3), class = b, coef = x2), data = anon, iter = 5000, seed = 2023, cores = 4, backend = "cmdstanr")

# (x2 has smaller units than the rest, hence the tighter prior)

> baymod_loo <- loo(baymod)
> baymod_loo$elpd_loo + baymod_loo$p_loo

[1] -1453.191

Now for the frequentist version, which should have the highest possible log likelihood given that it is fit by maximizing the log likelihood:

> logLik(glm(y ~ x1 + x2 + x3 + x4, family = binomial, data = anon))

'log Lik.' -1453.282 (df=6)

How on earth is this possible? Beating a ML model at its own game, despite regularization by priors? The mind boggles. Can anyone help unboggle it? The anonymized datafile is attached.

anon.txt (29.0 KB)

If you look at the actual log likelihood from the brms model, you’ll see that it’s never higher than the ML log likelihood:

ll <- logLik(baymod) |> rowSums()
summary(ll)
max(ll)

The main reason the Bayesian log likelihood is mostly substantially lower than the ML probably isn’t regularization by the prior but rather because of the competition between the likelihood function, which is highest at the ML estimate, and the volume in regions of progressively lower likelihood, which increases rapidly. Thus, the region of parameter space in a very tight ball around the ML estimate is tiny and represents only a small part of the posterior mass, even if that’s where the posterior mode is.

I don’t know enough about how elpd_loo and p_loo get summarized down to point estimates to explain why their sum is (a tad) higher. Maybe @avehtari can chime in?

1 Like

Wow you’re right, the draw-wise log posterior predictive densities are never higher. Something very perplexing is afoot.

FYI, the same weird result occurs when we calculate the log posterior predictive density “manually”:

> sum(dbinom(anon$y, size = 1, prob = fitted(baymod)[,1], log = T))

[1] -1453.191
1 Like

Interesting. We get this high log likelihood when we plug in the posterior mean from the bayesian model, which is also apparently how p_loo gets calculated. I don’t know why glm finds a worse solution, but it definitely does find a worse solution:

sum(dbinom(anon$y, size = 1, prob = fitted(glmfit), log = T))
logLik(glmfit)
sum(dbinom(anon$y, size = 1, prob = fitted(baymod)[,1], log = T))

The first thing to confirm is that it is definitely fitting the same model (for example, is it definitely using the same link function, etc). I think it is. The second thing to try to understand is whether it is really guaranteed to converge to the MLE, which I thought it was since this is the canonical link, but this is the end of my expertise about the frequentist fitting routines underpinning glm.

A final question is whether there are numerical variations in how the fitted values are extracted that might lead to minor variation (e.g. due to rounding and numerical precision issues) in the result of the dbinom calls.

Edit: The high log likelihood does not result from plugging the posterior mean parameter estimates into the log-likelihood computation–it arises when we plug in the posterior mean fitted values! And this could indeed yield a higher likelihood than the maximum likelihood solution. See below.

1 Like

I guess brms has a better likelihood-maximization algorithm than stats::glm(). It is much newer software after all. The likelihood is so strong and the priors so weak that the posterior means end up reflecting nothing but the quality of the likelihood optimizer.

I cannot have this though. I’m looking to present a comparison of six models, frequentist and Bayesian, along with their raw log scores and log scores penalized by methods such as AIC, loo-cv and psis-loo. I simply cannot have a Bayesian model with a better raw log score than the frequentist one. That would undermine the whole illustration.

I need to find a way to make the frequentist model perform at least as well as the Bayesian one in terms of unpenalized log score.

I now seem to have figured out one way around this:

> library(rstanarm)
> set.seed(7)
> glmfit2 <- stan_glm(formula(baymod), algorithm = "optimizing", prior = NULL, prior_intercept = NULL, family = binomial, data = anon)
> loo.glmfit2 <- loo(glmfit2)
> loo.glmfit2$elpd_loo + loo.glmfit2$p_loo

[1] -1453.181

Isn’t this equivalent to a frequentist model, and hence a solution? If so, I don’t understand the need for set.seed(). The results vary by seed, sometimes producing (slightly) higher log scores, sometimes slightly lower ones. But isn’t optimization supposed to be fully deterministic, given that no simulation is involved? At least loo() was fully deterministic as of the last time I checked, as long as no refits were needed.

EDIT: Another weirdness is that with this “solution”, sum(dbinom(anon$y, 1, prob = glmfit2$fitted.values, log = T)) returns a different (worse) result than elpd_loo + p_loo. I guess it must be because $fitted.values is based on posterior means (so says the documentation), whereas loo() is based on something else. I presume loo() uses the whole “posterior”, which reduces to the curve of the log-likelihood function in this case. Yes? No?

All comments welcome.

Remember that elpd_loo is an approximation. If you want the exact leave-one-out predictive density of your model then you need to apply k-fold cross-validation where K is the number of observations.

brms: K-Fold Cross-Validation — kfold.brmsfit • brms

rstanarm: K-fold cross-validation — kfold.stanreg • rstanarm

1 Like

Yes. I need to also run exact loo on my frequentist-wannabe stanarm model. Before seeing your message, I already did so using a simple for-loop and rstanarm::predict(). The function’s documentation says it is intended specfiically for use with models fit using optimization. Does this mean I can rest assured that the predictions are based exclusively on the log-likelihood modes, as in frequentist inference?

EDIT: Unfortunately it appears that fitted(mymod), mymod$fitted.values, and predict(mymod, type = "response") all return identical results. The log sum of those results differs from elpd_loo - p_loo, and it is lower than the log score of the Bayesian model. So this approach seems to have failed, unless those three functions are returning predictions based on posterior means rather than modes (ML estimates). I need predictions based on ML estimates, but am not sure where to find the ML estimates. Neither rstanarm::summary() nor rstanarm::fixef() tells me what its point estimates are based upon. The helpfile of stanreg_objects suggests it’s posterior means though.

A few points.

  1. Optimization is not deterministic, though assuming it converges to the correct mode it should always end up within a particular neighborhood that is quite small. Numerical methods aren’t exact, and the manifestation of the non-exactness depends on the choice of initialization.
  2. It is weird to run full bayes MCMC and then do any model comparison at all based on summarizing the posterior down to a point. If your goal is to generate a point estimate, then don’t use MCMC.
  3. With all of that said, I agree it’s surprising on its face that glm yields a “worse” solution. I now have a new guess for why this might be. Note that the fitted values returned by the fitted.brmsfit are NOT the fitted values evaluated at the posterior mean. Rather, they are the means of the fitted values evaluated over the entire posterior. As such, they do not actually represent a solution to the model equations (due to the nonlinear link function). That is, it might be the case that there is no set of fixed parameter values that yields predictions as good as the posterior mean predictions. I suspect that glm has found the optimal solution (to within some numerical tolerance), and rstanarm’s optimizer has probably found the same solution (to within some numerical tolerance), but that simulating fitted values from the posterior and then averaging yields a set of point estimates for the fitted values that is actually better than the maximum likelihood solution.
3 Likes

Yes. That’s why I used glm() initially. It was supposed to do better than the Bayesian posterior but somehow did worse, hence the existence of this thread. We’re making progress though!

a-HA! Yes, we can now verify that both the glm() and Bayesian point estimates are on point (pun intended):

> post <- as.matrix(posterior_samples(baymod))# posterior distribution
> X <- model.matrix(as.formula(baymod$formula), data = anon)# design matrix
> point.est <- colMeans(post[,which(startsWith(colnames(post), "b_"))])
  # posterior means for coefficients
> sum(dbinom(anon$y, 1, prob = plogis(X %*% point.est), log = T))

[1] -1453.282

> logLik(glm(formula(baymod), family = binomial, data = anon))

'log Lik.' -1453.282 (df=6)
# Hurray!

That’s it! It turns out that the raw Baysian log score is indeed better than the ML solution, and the way to calculate it is given in Equation 3 of Vehtari, Gelman and Gabry 2017. Here’s an ad-hoc way to do so manually in R:

> lpd <- sum(sapply(1:nrow(X), function(n) log(
  1/nrow(post) * sum(sapply(1:nrow(post), function(S) dbinom(anon$y[n], 1, prob = plogis(sum(post[S, which(startsWith(colnames(post), "b_"))]*X[n,]))) ) ) 
  ) ) )

> c(Official = baymod_loo$elpd_loo + baymod_loo$p_loo, Manual = lpd)

 Official    Manual 
-1453.191 -1453.191 

Looks like a wrap, doesn’t it?

3 Likes

There are theoretical arguments & there is looking at the data. I ran the univariate regressions for each of the four predictors x1 … x4 and discovered that the simple logistic regression y ~ x2 reproduces the issue.

Here is a plot of y against x2 with two fits: brms in red and glm in blue:

The size indicates the relative number of observations and points in black indicate a single observation. So an alternative explanation is that there are a handful of observations at the upper range of x2 which influence the fit quite a bit. The point with x2 = 51 pulls the glm fit a bit more than the brm fit. Perhaps a property of the “weighted” in iteratively reweighted least squares?

My limited understanding is that using the canonical link function iteratively reweighted least squares (when it doesn’t fail badly) should be guaranteed to converge to the maximum likelihood solution, up to numerical tolerances. I’m very open to being corrected if my understanding is deficient!!

1 Like

The glm solution is the IRLS solution, it’s not the MLE. I think your calculation:

sum(dbinom(anon$y, size = 1, prob = fitted(glmfit), log = TRUE))
sum(dbinom(anon$y, size = 1, prob = fitted(brmfit)[, 1], log = TRUE))

shows that there is a solution with higher log-likelihood than the IRLS solution.

I don’t profess to know for sure that the IRLS solution and the MLE are identical, but I thought that they were when using the canonical link.

The demonstration above does not confirm that they aren’t identical, because the output of fitted(brmfit)[,1] yields a set of fitted values that are impossible to achieve via any choice of parameter estimates. There is no parameter vector that will yield these fitted values, because these values are averaged over the posterior on the identity scale (i.e. after taking the inverse link).

2 Likes

Okay, I see what you mean now. fitted is \operatorname{average}(f(\theta^{(i)})), not f(\operatorname{average}(\theta^{(i)})). And if running glm with starting values etastart = fitted(brmfit, scale = "linear") for the linear predictor, IRLS still moves away and finds the same solution as before.

2 Likes

That’s a nice plot. How do you get those sample size circles at the top and bottom?

Here is the code to make the plot.

newdata <- tibble(x2 = seq(min(anon$x2), max(anon$x2), len = 101)) # newdata for univariable model `y ~ x2`.
newdata$brm <- fitted(brmfit.x2, newdata = newdata, scale = "response")[, 1]
newdata$glm <- predict(glmfit.x2, newdata = newdata, type = "response")
anon %>%
  ggplot(aes(x2, y)) +
  geom_line(
    aes(color = method),
    data = newdata %>% pivot_longer(c(brm, glm), names_to = "method", values_to = "y"),
    linewidth = 0.5
  ) +
  geom_point(
    aes(shape = `n = 1`, size = n),
    data = anon %>% count(x2, y) %>% mutate(`n = 1` = factor(n == 1))
  ) +
  scale_shape_manual(values = c(1, 20)) +
  guides(size = "none",shape = "none")
1 Like

Many thanks, sir!