Differential shrinkage of identical quantities by the same prior. Why?

Consider these binomial mock data:

ni <- 20; pF = 0.75; pM = 0.90
(Dat <- data.frame(Sex = c("Female","Male"), y = c(ni*pF, ni*pM), ni = c(ni,ni)))

     Sex  y ni
1 Female 15 20
2   Male 18 20

When I fit a simple binomial model with N(0,2.5) priors to these data, \beta_1 always shrinks more than the intercept.

> fixef(Mod <- brm(y|trials(ni) ~ Sex, prior = prior(normal(0,2.5)) + prior(normal(0,2.5), class = Intercept), seed = 1, family = binomial, backend = "cmdstanr", threads = threading(Cores, static = T), data = Dat, refresh = 0))

          Estimate Est.Error       Q2.5    Q97.5
Intercept 1.157260 0.5220473  0.2029140 2.222686
SexMale   1.062332 0.8848350 -0.6021272 2.831809


> fixef(update(Mod, seed = 2))
          Estimate Est.Error       Q2.5    Q97.5
Intercept 1.181649 0.5313671  0.1867951 2.226425
SexMale   1.044568 0.8676930 -0.5673215 2.874940

>  fixef(update(Mod, seed = 3))

          Estimate Est.Error       Q2.5    Q97.5
Intercept 1.169516 0.5298502  0.1995098 2.304199
SexMale   1.037915 0.8573108 -0.5802437 2.750812

No matter how many times I refit, the same pattern occurs. Itā€™s not random. Yet the logits should be the same. The Intercept represents a group with a sample success rate of 0.75 = 1.1 on the logit scale, while the ā€œmaleā€ group has 0.90, which is 1.1 logits higher than the baseline represented by the intercept, hence should also be estimated at exactly the same amount.

Thereā€™s something Iā€™m missing here.

1 Like

The more fundamental answer to your question is that the likelihood contains more information about one of the quantities than the other.

For example, if y is 20 and ni is also 20, the likelihood says that the true logit could well be anything between about 2 and infinity. If the true value is 2 million, a N(0, 2.5) prior will be very happy to shrink that down to something near 2 or 3.

Additionally, your question indicates that you arenā€™t aware that the prior you are placing on the intercept is a prior on the intercept when all columns of the design matrix are held at their means, rather than when they are all held at zero. So the prior you are placing on the intercept is not a prior on Female but rather a prior on the midpoint between Male and Female. If you want to turn off this behavior, then replace your formula with y|trials(ni) ~ 0 + Intercept + Sex. For more, see the brms documentation, particularly the pararagraph heading Parameterization of the population-level intercept on page 44 here: https://cran.r-project.org/web/packages/brms/brms.pdf

5 Likes

Do you mean because the ā€œMaleā€ success rate is closer in logit space to \infty, its uncertainty (and hence shrinkage??) is greater? If so, I donā€™t see how that could be the reason. If we make Male the reference level (Intercept), the excess shrinkage changes targets:

> fixef(update(Mod, ~. -Sex +I(Sex == "Female")))

                    Estimate Est.Error       Q2.5     Q97.5
Intercept           2.230663 0.7217941  0.9593482 3.7987675
ISexEQEQFemaleTRUE -1.052406 0.8610467 -2.8698445 0.5362466

^ Consistently, no matter the number of refits.

It says ā€œThis leads to a temporary bias in the intercept equal to <X_means, b>, where <,> is the scalar product. The bias is corrected after fitting the model [ā€¦]ā€ which is a bit too jargony for me. Does it mean the intercept is actually estimated as the grand mean of the dataset, and then after the sampling is done, its posterior is shifted back by simply adding a constant?

And if so, is it equivalent to the following:

> Dat$Sex.C <- as.integer(Dat$Sex == "Male") -0.5
> update(Mod, ~. -1 -Sex +Intercept +Sex.C, newdata = Dat)

ā€¦just without shifting Interceptā€™s posterior back? It doesnā€™t look entirely so, because the ā€œtrue centeredā€ Intercept of the resulting model has a slightly smaller SE (always around 0.46) than the ā€œpseudo-interceptā€ of the original model. For its part, the overshrinkage of the Sex coefficient looks the same regardless of whether I center it.

Yes, and this is the most fundamental issue that needs clarification. At the end of this post I provide a concrete illustration.

Youā€™re right that there are some additional things going on.

The implications of this are difficult to reason through because you havenā€™t updated to use 0 + Intercept syntax and you therefore arenā€™t putting a prior on the quantity that you are calling the intercept!

The less jargony and more important part of the documentation is:

be aware that you are effectively defining a prior on the
intercept of the centered design matrix not on the real intercept

and

This behavior can be avoided by using the reserved (and internally generated) variable Intercept.
Instead of y ~ x, you may write y ~ 0 + Intercept + x. This way, priors can be defined on the real
intercept, directly.

The design matrix centering that brms does internally doesnā€™t affect the likelihood that you are fitting, but it does affect how you specify priors. In particular, it means that the prior that you pass for the ā€œInterceptā€ is the prior on the response when all predictors are set to their means, rather than when all predictors are set to zero.

With all of that said, even if you used the 0 + Intercept syntax, you would not observe less shrinkage in the margin for the effect of femalethan for the intercept for the reference level male. The thing that should shrink less is the actual posterior expectation for female, which is the sum of the intercept for male and the coefficient for female. These quantities will have a negative correlation in the posterior. Think about it this way: if weā€™re highly uncertain about the correct prediction for male, but quite certain about the correct prediction for female, then we will necessarily be highly uncertain about the difference between those two categories.

This is a general point and a general challenge when defining priors on factor covariates when the intercept corresponds to a reference level: the prior pushforwards for levels other than the reference level will always be less certain than the prior pushforward for the reference level! To avoid this behavior, one option is to encode the factors in the design matrix with [-1, 1] encoding rather than [0, 1] encoding. Then, assuming symmetrical zero-centered priors, you get identical pushforwards for each level, including the reference level (which gets encoded with -1s in the design matrix rather than 0s).

The concrete example mentioned above:
Letā€™s take a look using an actual generative simulation for the data. To stabilize the noise in the simulation, weā€™ll use many more trials. To still observe shrinkage and avoid having the likelihood dominate, weā€™ll use a much more strongly regularizing prior. Notice that we consistently get more shrinkage in the estimate for males than the estimate for females below, even though in both cases the true values are at the ten-sigma mark in the prior.

ni <- 1000
pF = boot::inv.logit(1)
pM = boot::inv.logit(2)

Dat <- data.frame(
  Sex = c("Female","Male"), 
  y = c(rbinom(1, ni, pF), rbinom(1, ni, pM)), 
  ni = c(ni,ni)
  )
Female <- Dat[1, ]
Male <- Dat[2, ]

out_female <- brm(
  y|trials(ni) ~ 0 + Intercept, 
  prior = prior(normal(0,0.1)), 
  family = binomial, 
  backend = "cmdstanr", 
  data = Female
  )
out_female

out_male <- brm(
  y|trials(ni) ~ 0 + Intercept, 
  prior = prior(normal(1,0.1)), 
  family = binomial, 
  backend = "cmdstanr", 
  data = Male
  )
out_male

The reason for this difference is because the likelihood, for some fixed sample size, identifies logits near 0 more confidently than logits far from zero.

6 Likes

Thanks a lot Jacob, this is super helpful. I had no idea ML fitting algorithms got cold feet about large coefficients. It is all a lot to chew on.

So basically, the reference level gets an unfair advantage over non-reference levels when it comes to the precision of estimation. Hence if thereā€™s one category weā€™re more interested in than the others, we might want to make that one the reference level.

Was I on the right track in gathering that the default type of ā€œpseudo-interceptā€ first estimates y when all x's take their mean value, but is shifted back to representing the ā€œoriginā€ by adding constant a to it once sampling finishes?

Not sure what ML means here; I usually think it means maximum likelihood or maybe machine learning, of which I guess Bayesian inference is a subset. Bayesian fitting gets cold feet about fitting large coefficients due to the regularizing effect of the prior. The point here is that on the logit scale the likelihood function gets increasingly more shallow as the sample gets more imbalanced. You would not see a similar effect with an identity link.

Not precision of estimation, but rather the amount of regularization imposed by the prior.

I wouldnā€™t say so; weā€™d just want to make sure that we understand the prior pushforward that weā€™re getting over that level and ensure that itā€™s the prior we want.

Yes. Itā€™s a constant conditional on the fitted values of the coefficients, which means that it is a different constant for each iteration of the MCMC.

1 Like

(EDIT: This post has been severely overhauled since its initial appearance due to serious errors detected by the author. If you dismissed it as the ramblings of a village idiot on first reading, please take another look.)

Many thanks for the further clarifications! And sorry about being vague with the ML acronym: I didnā€™t know it could mean anything other than maximum likelihood.

Thanks for the confirmation!

It seems to me that these two sentences embed two profound yet separate lessons that Iā€™m only beginning to understand. Hence I must belabor further. Back to the original mock data:

> ni <- 20; pF = 0.75; pM = 0.90 
> (Dat <- data.frame(Sex = c("Female","Male"), y = c(ni*pF, ni*pM), ni = c(ni,ni)))

     Sex  y ni
1 Female 15 20
2   Male 18 20

> summary(glm(y/ni ~ Sex, weights = ni, family = binomial, data = Dat))$coefficients

            Estimate Std. Error  z value   Pr(>|z|)
(Intercept) 1.098612  0.5163978 2.127454 0.03338242
SexMale     1.098612  0.9067647 1.211574 0.22567560

The sample-based logits of each group are identical relative to their respective baselines: ML estimation uses the modes of the log-likelihood as point estimates, reporting them as identical. Yet the non-intercept is estimated to be more uncertain, even though thereā€™s no prior.

In this particular setting, it looks to me like the larger SE of the non-intercept is not due to the logit link or binomial likelihood, but to the fact to which you referred earlier in the context of Bayesian models, i.e. that the Male prediction derives from two coefficients rather than just one. For behold what happens when we model the same two logit values using Ordinary Least Squares with an identity link:

> Dat2 <- data.frame(Dat[c(1,1,  2,2),], row.names = NULL)
> Dat2$y.new <- with(Dat2, qlogis(y/ni)+c(-0.2, +0.2)); Dat2

     Sex  y ni     y.new
1 Female 15 20 0.8986123
2 Female 15 20 1.2986123
3   Male 18 20 1.9972246
4   Male 18 20 2.3972246

> summary(lm(y.new ~ Sex, data = Dat2))$coefficients

            Estimate Std. Error  t value   Pr(>|t|)
(Intercept) 1.098612  0.2000000 5.493061 0.03157991
SexMale     1.098612  0.2828427 3.884181 0.06034527

The coefficients are exactly the same as in the binomial model, and the SE of the non-intercept remains larger than that of the intercept!

If we now put the Intercept back, thus making Female the reference level again, then proceed to fit the binomial model with a Bayesian N(0, \sigma) prior with a small \sigma, we see what appears to be another reflection of the same phenomenon:

> fixef(Mod <- brm(y|trials(ni) ~ 0 + Intercept + Sex, prior = prior(normal(0, .25)), family = binomial, backend = "cmdstanr", data = Dat))

           Estimate Est.Error       Q2.5     Q97.5
Intercept 0.4563496 0.1973907  0.0714488 0.8368387
SexMale   0.2826627 0.2218723 -0.1447267 0.7202204

The non-reference level shrinks more. So, hereā€™s yet another way one might think about it, and Iā€™d like to know if itā€™s correct: the uncertainty difference occurs because the reference level has a fixed baseline of comparison at 0, whereas the non-reference level has non-fixed baseline of comparison estimated from data. Both are aiming darts at identical bullseyes, but one is standing on solid ground while throwing, the other is standing on a balanceboard.

Thereā€™s also Jacobā€™s second point, to the effect that with fixed n, the binomial (log) likelihood is less certain about large logits than small ones. Yes! I believe thereā€™s a parsimonious way to demonstrate this with the same mock data, without even going Bayesian:

summary(glm(y/ni ~ 0 + Sex, weights = ni, family = binomial, data = Dat))$coefficients

          Estimate Std. Error  z value    Pr(>|z|)
SexFemale 1.098612  0.5163978 2.127454 0.033382417
SexMale   2.197225  0.7453560 2.947886 0.003199549

Even though the Wald z value and Pr(>|z|) do diagnose the Male effect as ā€œmore significantā€ than the Female one due to its larger magnitude, the important part is the Std. Error, which is almost 50% larger than the Female one even though both are based on the same amount of data and have the same baseline of comparison, i.e. zero. A memorable lesson indeed!

But Iā€™ve run into yet another dilemma! Look what happens when we make our normal prior ultra-diffuse:

fixef(Mod2 <- update(Mod, prior = prior(normal(0, 25))))

          Estimate Est.Error       Q2.5    Q97.5
Intercept 1.154554 0.5279791  0.1905524 2.251391
SexMale   1.260148 0.9647825 -0.4800386 3.277488

Over and over across refits, b_SexMale overshoots its MLE by a wider margin than b_Intercept. Whaaat? The greater uncertainty (=larger SE) makes sense in light of everything stated above, but how can the prior be pulling the estimate away from zero rather than towards it? Even with the \sigma = 25 the prior still has its mode and mean at 0, so if anything, that is the value it should pull the estimate toward. Whatā€™s going on?