Visualizing shrinkage of an infinite MLE in a simple logistic example

A have a simple mock dataset and a logistic model where two predictor categories exhibit quasi-complete separation. What I want to do is visualize how the resulting infinite MLE is “tamed” by the prior. Below is a minimal example. Even though I use brms, choice of interface should be irrelevant here.

n <- c(96, 48, 24, 12, 6, 3)
x <- factor(LETTERS[1:length(n)])
0.75 -> p
y <- n*c(0.5, p, 1-p, p, 0, 1)
d <- data.frame(y =  unlist(sapply(1:length(y), function(i) c(rep(1, y[i]), rep(0, n[i]-y[i])))),
                x = rep(x, times = n))
Bayfit <- brm(y ~ x, family = bernoulli, data = d, prior =  prior(normal(0,2.5), class = Intercept) + prior(normal(0,2.5), class = b), seed = 3, warmup = 1000, iter = 3500)
Post <- posterior_samples(Bayfit)

With the model thus fit and the posterior samples stored, now we visualize the infinite MLE, the prior and the posterior:

curve(dbinom(0, 6, plogis(x)), -10, 10, lwd = 2, col = "blue", xlab = expression(beta[4]), ylab = "P")
curve(dnorm(x, 0, 2.5), add = TRUE, col = "red", lwd = 2)
lines(density(Post$b_xE), lwd = 2)
legend(4.5, 0.9, legend = c("Prior","Likelihood","Posterior"), col = c("red","blue","black"), lwd = 2, cex = 0.8)

shrinkage

While I’m confident that this illustration is “directionally correct”, I have an uneasy feeling about the blue (likelihood) graph. AFAIK the likelihood, unlike the prior and the posterior, is NOT a PDF and need not integrate to 1. This seems clear from the graph, where the likelihood has by far the largest area under the curve. How big of a problem is this? Is there some simple constant that could or should be applied to the likelihood in order to make it more commensurate to the prior and the posterior? I think the present graph works alright for the purpose of illustrating the action of the prior on the posterior, but it would still suck if it contained some glaring mathematical idiocy. Does it?

Stan works with the log-likelihood. I’m not too familiar with brms, but it could be worth checking if b_xe is transformed or not.

AFAIK if I use the log likelihood, then I also need to use the log prior and logs of the densities of the posterior samples. Here’s the only way I can think of doing it:

logdens.E <- sapply(Post$b_xE, function(x) log(density_ratio(Post$b_xE, point = x)))
curve(dbinom(sum(d$y[d$x == "E"]), sum(d$x == "E"), plogis(x), log = TRUE), from = min(Post$b_xE), to = max(Post$b_xE), lwd = 2, col = "blue", xlab = expression(beta[4]), ylab = "P")
curve(dnorm(x, 0, 2.5, log = TRUE), add = TRUE, col = "red", lwd = 2)
points(x = Post$b_xE, y = logdens.E)
legend(-11.5, -1, legend = c("Prior","Likelihood","Posterior"), col = c("red","blue","black"), lwd = 2, cex = 0.8)

shrinkage

Is this more mathematically correct than the first version? Indeed the log of the likelihood doesn’t look as disproportionately large as the likelihood itself. Otherwise, however, this plot is much uglier than the previous one. I don’t know how to get a smooth graph using logged densities of the posterior simulations.

Oh, I mostly thought you might have substituted the log-likelihood for the likelihood in the original version by accident. I think the scale was fine. I do know that is is somewhat common to use a scaled version of the likelihood in the prior + data → posterior plots to make things look nicer.

The original version used the likelihood, not its log. The new one has logged versions of all three distributions (I’ve now updated the legend to reflect this). I’m glad if there are no obvious mathematical stupidities present in either plot though.

Yeah, I wish I knew the scaling factor.