# Differential shrinkage of identical quantities by the same prior. Why?

Consider these binomial mock data:

ni <- 20; pF = 0.75; pM = 0.90
(Dat <- data.frame(Sex = c("Female","Male"), y = c(ni*pF, ni*pM), ni = c(ni,ni)))

Sex  y ni
1 Female 15 20
2   Male 18 20


When I fit a simple binomial model with N(0,2.5) priors to these data, \beta_1 always shrinks more than the intercept.

> fixef(Mod <- brm(y|trials(ni) ~ Sex, prior = prior(normal(0,2.5)) + prior(normal(0,2.5), class = Intercept), seed = 1, family = binomial, backend = "cmdstanr", threads = threading(Cores, static = T), data = Dat, refresh = 0))

Estimate Est.Error       Q2.5    Q97.5
Intercept 1.157260 0.5220473  0.2029140 2.222686
SexMale   1.062332 0.8848350 -0.6021272 2.831809

> fixef(update(Mod, seed = 2))
Estimate Est.Error       Q2.5    Q97.5
Intercept 1.181649 0.5313671  0.1867951 2.226425
SexMale   1.044568 0.8676930 -0.5673215 2.874940

>  fixef(update(Mod, seed = 3))

Estimate Est.Error       Q2.5    Q97.5
Intercept 1.169516 0.5298502  0.1995098 2.304199
SexMale   1.037915 0.8573108 -0.5802437 2.750812


No matter how many times I refit, the same pattern occurs. It’s not random. Yet the logits should be the same. The Intercept represents a group with a sample success rate of 0.75 = 1.1 on the logit scale, while the “male” group has 0.90, which is 1.1 logits higher than the baseline represented by the intercept, hence should also be estimated at exactly the same amount.

There’s something I’m missing here.

1 Like

For example, if y is 20 and ni is also 20, the likelihood says that the true logit could well be anything between about 2 and infinity. If the true value is 2 million, a N(0, 2.5) prior will be very happy to shrink that down to something near 2 or 3.

Additionally, your question indicates that you aren’t aware that the prior you are placing on the intercept is a prior on the intercept when all columns of the design matrix are held at their means, rather than when they are all held at zero. So the prior you are placing on the intercept is not a prior on Female but rather a prior on the midpoint between Male and Female. If you want to turn off this behavior, then replace your formula with y|trials(ni) ~ 0 + Intercept + Sex. For more, see the brms documentation, particularly the pararagraph heading Parameterization of the population-level intercept on page 44 here: https://cran.r-project.org/web/packages/brms/brms.pdf

5 Likes

Do you mean because the “Male” success rate is closer in logit space to \infty, its uncertainty (and hence shrinkage??) is greater? If so, I don’t see how that could be the reason. If we make Male the reference level (Intercept), the excess shrinkage changes targets:

> fixef(update(Mod, ~. -Sex +I(Sex == "Female")))

Estimate Est.Error       Q2.5     Q97.5
Intercept           2.230663 0.7217941  0.9593482 3.7987675
ISexEQEQFemaleTRUE -1.052406 0.8610467 -2.8698445 0.5362466


^ Consistently, no matter the number of refits.

It says “This leads to a temporary bias in the intercept equal to <X_means, b>, where <,> is the scalar product. The bias is corrected after fitting the model […]” which is a bit too jargony for me. Does it mean the intercept is actually estimated as the grand mean of the dataset, and then after the sampling is done, its posterior is shifted back by simply adding a constant?

And if so, is it equivalent to the following:

> Dat$Sex.C <- as.integer(Dat$Sex == "Male") -0.5
> update(Mod, ~. -1 -Sex +Intercept +Sex.C, newdata = Dat)


…just without shifting Intercept’s posterior back? It doesn’t look entirely so, because the “true centered” Intercept of the resulting model has a slightly smaller SE (always around 0.46) than the “pseudo-intercept” of the original model. For its part, the overshrinkage of the Sex coefficient looks the same regardless of whether I center it.

Yes, and this is the most fundamental issue that needs clarification. At the end of this post I provide a concrete illustration.

You’re right that there are some additional things going on.

The implications of this are difficult to reason through because you haven’t updated to use 0 + Intercept syntax and you therefore aren’t putting a prior on the quantity that you are calling the intercept!

The less jargony and more important part of the documentation is:

be aware that you are effectively defining a prior on the
intercept of the centered design matrix not on the real intercept


and

This behavior can be avoided by using the reserved (and internally generated) variable Intercept.
Instead of y ~ x, you may write y ~ 0 + Intercept + x. This way, priors can be defined on the real
intercept, directly.


The design matrix centering that brms does internally doesn’t affect the likelihood that you are fitting, but it does affect how you specify priors. In particular, it means that the prior that you pass for the “Intercept” is the prior on the response when all predictors are set to their means, rather than when all predictors are set to zero.

With all of that said, even if you used the 0 + Intercept syntax, you would not observe less shrinkage in the margin for the effect of femalethan for the intercept for the reference level male. The thing that should shrink less is the actual posterior expectation for female, which is the sum of the intercept for male and the coefficient for female. These quantities will have a negative correlation in the posterior. Think about it this way: if we’re highly uncertain about the correct prediction for male, but quite certain about the correct prediction for female, then we will necessarily be highly uncertain about the difference between those two categories.

This is a general point and a general challenge when defining priors on factor covariates when the intercept corresponds to a reference level: the prior pushforwards for levels other than the reference level will always be less certain than the prior pushforward for the reference level! To avoid this behavior, one option is to encode the factors in the design matrix with [-1, 1] encoding rather than [0, 1] encoding. Then, assuming symmetrical zero-centered priors, you get identical pushforwards for each level, including the reference level (which gets encoded with -1s in the design matrix rather than 0s).

The concrete example mentioned above:
Let’s take a look using an actual generative simulation for the data. To stabilize the noise in the simulation, we’ll use many more trials. To still observe shrinkage and avoid having the likelihood dominate, we’ll use a much more strongly regularizing prior. Notice that we consistently get more shrinkage in the estimate for males than the estimate for females below, even though in both cases the true values are at the ten-sigma mark in the prior.

ni <- 1000
pF = boot::inv.logit(1)
pM = boot::inv.logit(2)

Dat <- data.frame(
Sex = c("Female","Male"),
y = c(rbinom(1, ni, pF), rbinom(1, ni, pM)),
ni = c(ni,ni)
)
Female <- Dat[1, ]
Male <- Dat[2, ]

out_female <- brm(
y|trials(ni) ~ 0 + Intercept,
prior = prior(normal(0,0.1)),
family = binomial,
backend = "cmdstanr",
data = Female
)
out_female

out_male <- brm(
y|trials(ni) ~ 0 + Intercept,
prior = prior(normal(1,0.1)),
family = binomial,
backend = "cmdstanr",
data = Male
)
out_male


The reason for this difference is because the likelihood, for some fixed sample size, identifies logits near 0 more confidently than logits far from zero.

6 Likes

Thanks a lot Jacob, this is super helpful. I had no idea ML fitting algorithms got cold feet about large coefficients. It is all a lot to chew on.

So basically, the reference level gets an unfair advantage over non-reference levels when it comes to the precision of estimation. Hence if there’s one category we’re more interested in than the others, we might want to make that one the reference level.

Was I on the right track in gathering that the default type of “pseudo-intercept” first estimates y when all x's take their mean value, but is shifted back to representing the “origin” by adding constant a to it once sampling finishes?

Not sure what ML means here; I usually think it means maximum likelihood or maybe machine learning, of which I guess Bayesian inference is a subset. Bayesian fitting gets cold feet about fitting large coefficients due to the regularizing effect of the prior. The point here is that on the logit scale the likelihood function gets increasingly more shallow as the sample gets more imbalanced. You would not see a similar effect with an identity link.

Not precision of estimation, but rather the amount of regularization imposed by the prior.

I wouldn’t say so; we’d just want to make sure that we understand the prior pushforward that we’re getting over that level and ensure that it’s the prior we want.

Yes. It’s a constant conditional on the fitted values of the coefficients, which means that it is a different constant for each iteration of the MCMC.

1 Like

(EDIT: This post has been severely overhauled since its initial appearance due to serious errors detected by the author. If you dismissed it as the ramblings of a village idiot on first reading, please take another look.)

Many thanks for the further clarifications! And sorry about being vague with the ML acronym: I didn’t know it could mean anything other than maximum likelihood.

Thanks for the confirmation!

It seems to me that these two sentences embed two profound yet separate lessons that I’m only beginning to understand. Hence I must belabor further. Back to the original mock data:

> ni <- 20; pF = 0.75; pM = 0.90
> (Dat <- data.frame(Sex = c("Female","Male"), y = c(ni*pF, ni*pM), ni = c(ni,ni)))

Sex  y ni
1 Female 15 20
2   Male 18 20

> summary(glm(y/ni ~ Sex, weights = ni, family = binomial, data = Dat))$coefficients Estimate Std. Error z value Pr(>|z|) (Intercept) 1.098612 0.5163978 2.127454 0.03338242 SexMale 1.098612 0.9067647 1.211574 0.22567560  The sample-based logits of each group are identical relative to their respective baselines: ML estimation uses the modes of the log-likelihood as point estimates, reporting them as identical. Yet the non-intercept is estimated to be more uncertain, even though there’s no prior. In this particular setting, it looks to me like the larger SE of the non-intercept is not due to the logit link or binomial likelihood, but to the fact to which you referred earlier in the context of Bayesian models, i.e. that the Male prediction derives from two coefficients rather than just one. For behold what happens when we model the same two logit values using Ordinary Least Squares with an identity link: > Dat2 <- data.frame(Dat[c(1,1, 2,2),], row.names = NULL) > Dat2$y.new <- with(Dat2, qlogis(y/ni)+c(-0.2, +0.2)); Dat2

Sex  y ni     y.new
1 Female 15 20 0.8986123
2 Female 15 20 1.2986123
3   Male 18 20 1.9972246
4   Male 18 20 2.3972246

> summary(lm(y.new ~ Sex, data = Dat2))$coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 1.098612 0.2000000 5.493061 0.03157991 SexMale 1.098612 0.2828427 3.884181 0.06034527  The coefficients are exactly the same as in the binomial model, and the SE of the non-intercept remains larger than that of the intercept! If we now put the Intercept back, thus making Female the reference level again, then proceed to fit the binomial model with a Bayesian N(0, \sigma) prior with a small \sigma, we see what appears to be another reflection of the same phenomenon: > fixef(Mod <- brm(y|trials(ni) ~ 0 + Intercept + Sex, prior = prior(normal(0, .25)), family = binomial, backend = "cmdstanr", data = Dat)) Estimate Est.Error Q2.5 Q97.5 Intercept 0.4563496 0.1973907 0.0714488 0.8368387 SexMale 0.2826627 0.2218723 -0.1447267 0.7202204  The non-reference level shrinks more. So, here’s yet another way one might think about it, and I’d like to know if it’s correct: the uncertainty difference occurs because the reference level has a fixed baseline of comparison at 0, whereas the non-reference level has non-fixed baseline of comparison estimated from data. Both are aiming darts at identical bullseyes, but one is standing on solid ground while throwing, the other is standing on a balanceboard. There’s also Jacob’s second point, to the effect that with fixed n, the binomial (log) likelihood is less certain about large logits than small ones. Yes! I believe there’s a parsimonious way to demonstrate this with the same mock data, without even going Bayesian: summary(glm(y/ni ~ 0 + Sex, weights = ni, family = binomial, data = Dat))$coefficients

Estimate Std. Error  z value    Pr(>|z|)
SexFemale 1.098612  0.5163978 2.127454 0.033382417
SexMale   2.197225  0.7453560 2.947886 0.003199549


Even though the Wald  z value and Pr(>|z|) do diagnose the Male effect as “more significant” than the Female one due to its larger magnitude, the important part is the Std. Error, which is almost 50% larger than the Female one even though both are based on the same amount of data and have the same baseline of comparison, i.e. zero. A memorable lesson indeed!

But I’ve run into yet another dilemma! Look what happens when we make our normal prior ultra-diffuse:

fixef(Mod2 <- update(Mod, prior = prior(normal(0, 25))))

Estimate Est.Error       Q2.5    Q97.5
Intercept 1.154554 0.5279791  0.1905524 2.251391
SexMale   1.260148 0.9647825 -0.4800386 3.277488


Over and over across refits, b_SexMale overshoots its MLE by a wider margin than b_Intercept. Whaaat? The greater uncertainty (=larger SE) makes sense in light of everything stated above, but how can the prior be pulling the estimate away from zero rather than towards it? Even with the \sigma = 25 the prior still has its mode and mean at 0, so if anything, that is the value it should pull the estimate toward. What’s going on?