How to scale your priors by number of predictors in logistic regression

How would we go about scaling your priors by the number of predictors in logistic regression? In a linear regression, @andrewgelman, Hill, and @avehtari (2020: 208) suggest scaling your predictors and then giving each of them a prior SD equal to

\sqrt{R^2_*/p}\times\text{sd}(y),

where R^2_* is your best guess at maximum R^2 achievable with these predictors, p is the number of predictors. They also say to give \sigma an exponential prior with mean \sqrt{1-R^2_*}\times \text{sd}(y).

But what formulas would I use in (multilevel) logistic regression with a Bernoulli outcome? I don’t know what to plug in for sd(y) when there’s a link function in between. Supposedly the standard logistic distribution has a variance of \pi^2/3, but somehow I doubt if that’s the right quantity.

I’m also unsure which pseudo-R^2 metric to base the speculative maximum on, given that many different alternatives exist (McFadden, Cox&Snell, Nagelkerke, Nagakawa&Schielzeth…). To be honest, I’d rather base the educated guess on something like the ROC AUC, given that I have a better sense of what kind of values to expect (0.90 or so).

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2021. Regression and Other Stories. Cambridge New York, NY Port Melbourne, VIC New Delhi Singapore: Cambridge University Press.

1 Like

I wrote up a poor man’s version of this type of prior for a beta and binomial combination on this forum.

The key part is the prior on the centered and scaled predictors on the logit scale.

\gamma \sim normal(0, \sigma_{\gamma}\sqrt{N_{\gamma}})

It’s nothing fancy but by doing that the total sd from the predictors (sd(X\gamma)) gets a prior of \sigma_{\gamma} which you can set according to what is appropriate (or maybe even give it its own prior). I am not sure how to directly related it to ROC AUC.

I thought of the \sigma_y as an indicator of how sure you expect the model to be if the predictors strongly point to one of two outcomes for an observation. Maybe that observation is 3 \sigma_{\gamma} away from the mean on the logit scale. If \sigma_{\gamma} = 1, this implies a probability of inv\_logit(3) = 0.95. If \sigma_y = 2, this implies a probability of inv\_logit(6) = 0.997. Or you could think about how well you think the model will discriminate between an observation that is \sigma_{\gamma} away and an observation that is -\sigma_{\gamma} away.

2 Likes

Thanks. Have you used this approach in a published work that I could cite?

One line of thinking that can be clarifying is to think about a binomial regression with an observation-level random effect, which is one way to model overdispersion in binomial regression settings (note that this doesn’t work in the case where the response is Bernoulli, i.e. binomial with one trial, but the intuition it builds might still be useful).

Treating this model as a generative model, note that what we have is precisely a linear regression (complete with homoskedastic Gaussian residuals) giving the true logit-probabilities, and then binomial sampling at a lower level of the hierarchy which by assumption cannot be explained by any conceivable predictors.

In this setting, you can reason about the embedded regression piece as you would reason about any other linear regression. Note that you don’t know the standard deviation of the true logit-probabilities, but we’re setting priors here, so for any prior guess about the variance in the true logit-probabilities you could derive the corresponding coefficient prior.

1 Like

…or whenever \geq1 of the binomial trials has only successes/failures? Turns out I do find the analogy useful, but my mind boggles at how the overdispersion could be quantified if there’s any (quasi-)separation in the data.

No! If I had to guess, you are confusing the sample proportions with the underlying fitted probabilities. If you have 5 failures in 5 trials, that is entirely consistent with a success probability of .5^5, whose logit is something like -3.5. The fitted probability has no need to go much lower than this, and the regularization stops that from happening.

set.seed(1)
probs <- boot::inv.logit(rnorm(1000, 0.5))
data <- rbinom(1000, 5, probs)
summary(data)
df <- data.frame(y = data, n = 5, id = 1:1000)
mod <- brm(
  y | trials(n) ~ 1 + (1 | id), 
  data = df, 
  family = "binomial", 
  backend = "cmdstanr"
  )
summary(mod)

If there’s actually complete separation in the data, then this doesn’t work without strong prior regularization, but then again neither does standard logistic regression.

1 Like

Thanks for the illustration. This is something I’ve wondered about before but never really understood. I guess it’s time I did.

But doesn’t the fitting process work with logits rather than probabilities? The M(ost) L(ikely) logit for a group with 0/5 successes is -\infty. How do you shrink infinity? Clearly the model somehow does, but I don’t understand how this is accomplished.

We don’t start by going to infinity and subsequently shrinking it. Consider what happens if we fix the random effect sd (so it’s just a regularizing prior) and think about where the best posterior estimate should be. Pulling the effect towards negative infinity incurs a huge penalty due to the prior, but provides very rapidly diminishing returns via the likelihood. Once the probability is small enough that we wouldn’t expect any successes, there’s not a strong benefit to making it smaller still.

1 Like

Yep I get this part, i.e. how a pre-specified Gaussian prior shrinks infinite logits. But I don’t get how it works with an adaptive prior that learns the SD from the data. I’d assume you’d first have to use the unshrunken MLEs to learn the SD of your assumed Gaussian distribution, and THEN you’d proceed to shrink the logits the way you described above. But when some of the raw unshrunken MLEs are infinite, how can you infer a SD for the adaptive prior?

You don’t. You estimate all the parameters jointly–the random effect mean, the random effect variance, and the random effect vector (and any other parameters in your model).

Edit: so just to push this all the way through:

Now consider what happens if we let the random effect sd vary. Pulling individual elements of the vector towards negative infinity while pulling the random effect sd towards infinity causes the random effect’s contribution to the target to become arbitrarily small. This is a huge penalty, and there is no commensurate benefit from the likelihood that could offset it.

1 Like

Our paper on priors for logistic regression coefficients is here: http://www.stat.columbia.edu/~gelman/research/published/priors11.pdf
But we’ve changed our attitudes a bit. In that paper, we recommended default Cauchy priors for logistic regression coefficients. Now we’re inclined to default to normal(0,1).

The more coefs you have, the more damage you’ll be doing by using weak priors.

1 Like

Unfortunately, no. It’s more that it made sense to me and I think the benefits are easy to explain without citations. It is similar in spirit to the R^2 prior. It’s a prior on the overall model first as opposed to the individual predictors.

Very interesting. My sources describe the GLMM as a “two-stage model”, where you first fit standard GLMs to each group (using the random effects, presumably unshrunken, as “known offsets”), and only then do you start maximizing an overall LL function, i.e. a sum of the group-wise LLs each of which is averaged over the estimated N(0, \sigma^2) distribution of the random effects. Those sources are admittedly frequentist, so perhaps that is to blame for my confusion.

What is the penalty? There are obviously diminishing returns to pulling individual elements toward -\infty or \infty, but the likelihood keeps increasing nonetheless. Do you mean a penalty imposed by the prior distribution of the group-level SD, which presumably has a thin tail that penalizes large values into oblivion?

If that’s the penalty though, shouldn’t it follow that if we give the group-level SD an extremely diffuse prior such as N(0, 1000)^+, the random effects will undergo virtually no shrinkage, and the model becomes almost the same as a no-pooling model, with virtually infinite elements in the random-effects vector? Yet this is not what I’m seeing (continuing with your example data):

> summary(mod2 <- update(mod, prior = prior(normal(0, 1000), class = sd), cores = Cores))
...
Group-Level Effects: 
~id (Number of levels: 1000) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.04      0.06     0.93     1.15 1.00     1426     2072

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.48      0.05     0.39     0.58 1.00     2082     2508

The results barely change! This is despite the fact that the 1000-strong dataset has over 200 groups with only successes or all failures. I would expect this to inflate the SD estimate to the hundreds at least (as happens when you call sd(c(-1,1,-1,1,-1,1,-1,1,-1000,1000)), but no. Even with a virtually flat prior on the SD, it somehow still shrinks the random effects. I hence remain confused.

I’m not a genius of frequentist methods, but I believe these sources are either misunderstood, incorrect, or perhaps are describing part of an iterative scheme that is applicable in LMM but not all GLMM contexts.

The penalty is that a short wide normal distribution places less probability density over its quantiles than a tall skinny normal distribution. It’s the same reason that a well fitting normal distribution to real data has a higher likelihood than the same distribution but with much higher variance.

Consider the case of a linear mixed model. Why not just estimate the standard deviation of the random effect distribution to be arbitrarily large, such that it imposes no shrinkage and yields a higher likelihood? Or for that matter, why not just estimate the residual standard deviation to be arbitrarily large? Normal distributions with too-large standard deviations fit poorly (i.e. have lower likelihood) because they place lower probability density over their quantiles. If we have two datasets with equal sample size, each well fit by a normal distribution, and one has substantially smaller variance than the other, then that one with the smaller variance also has a substantially higher likelihood.

1 Like

For what it’s worth, my overall advice to you is to (mentally) throw out your frequentist sources. They seem to do for you what frequentist sources usually do for students, which is to confuse them about the distinction between the statistical model being fit and the methods being used to fit the model or to make inference from the model. In the same way that there’s no F-distribution in the generative model or the likelihood function underlying an anova (and in my experience frequentist texts confuse students about this!), there is no two-step or iterative component to the likelihood function underlying a GLMM. For whatever reason, Bayesian sources seem to help students keep a much crisper conceptual distinction between the model and the methods used for fitting/inference.

The key intuition here, which Stan makes quite explicit, is that there is a function of the data and the parameters that we call the target, and the goal of model fitting is to infer the shape of the target. In the GLMM context, the target gets lower, quite rapidly, if we badly overestimate the random effect standard deviation, due to the “tall skinny” versus “short wide” distinction that I draw in my previous post.

2 Likes

Great points! I rarely use Gaussian models, so it didn’t occur to me that that dispersion business might affect likelihood. I’m only used to thinking about logits or probabilities, for which the “dispersion” is unobserved or depends on E(y). But with the observation-level random effect for the binomial data, it’s as if you added a residual standard deviation parameter to the model, and by definition ML methods try to keep it as small as possible. Hopefully that’s correct, finally.

This would also clear up the confusion about the “two stages”. Frequentist GLMMs are fit by maximizing a “marginal” likelihood, i.e. a likelihood that’s averaged over the random-effects distribution that is estimated at the first stage. To maximize that, you want the distribution to be as concentrated as possible, which is an incentive to pull random effects towards zero. It seems that Bayesian GLMMs differ in that they rather work with a “conditional” likelihood throughout, i.e. one which plugs in a vector of regularized random effects. Each iteration draws a different \sigma and vector of regularized random effects, and eventually you have a distribution with a mean and a mode. No need for two stages then, I guess.

Thanks for the update. I knew about that paper but was hoping the ideas would have become obsolete (i.e. been superseded by something simpler) in 15 years. While it’s easy to appreciate the principles expressed there, I find those techniques mighty confusing when it comes to interactions. I mean, take a simple experimental setting involving an Intervention which improves the success rate of men but not women:

> n = 200
> dat <- data.frame(
  Intervention = c(0,1,0,1), SexF = c(0,0,1,1), y = n*c(0.1, 0.50, 0.25, 0.25), n)
> Mod <- brm(y|trials(n) ~ Intervention * SexF, prior = prior(cauchy(0, 2.5)), family = binomial,  data = dat)
# Using Cauchy priors as instructed
> summary(Mod)

Population-Level Effects: 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            -2.14      0.23    -2.61    -1.72 1.00     1385     1785
Intervention          2.13      0.27     1.62     2.66 1.00     1319     1806
SexF                  1.02      0.28     0.50     1.58 1.00     1429     1904
Intervention:SexF    -2.10      0.35    -2.79    -1.43 1.00     1306     1958

^ With standard dummy coding everything makes sense. The coefficients reflect what we see in the data, i.e. that men have a low base rate but benefit from Intervention, while women have a higher base rate but do not benefit. But then we center our dummies as instructed:

> dat2 <- with(dat, data.frame(Intervention = Intervention-0.5, SexF = SexF-0.5, y, n))
> Mod2 <- update(Mod, newdata = dat2)
> summary(Mod2)
...
 Population-Level Effects: 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            -1.10      0.09    -1.28    -0.93 1.00     3909     3131
Intervention          1.10      0.18     0.76     1.45 1.00     3372     2603
SexF                 -0.01      0.18    -0.36     0.34 1.00     2937     2729
Intervention:SexF    -2.16      0.36    -2.91    -1.46 1.00     3019     2913

^ I guess there’s a benefit in terms of smaller standard errors, but the coefficients are now much harder to make sense of. For me, this is a significant deterrent against the approach proposed by the paper. Perhaps I’m just uniquely slow in the head.

I may be entirely off-base here, but part of your confusion may arise from the way that brms is encoding these models in Stan, at least with respect to your toy example of an sex * intervention interaction. Under the hood, brms centers the predictors prior to fitting the model, such that the Intercept represents the mean across all observations. The results reported by summary() reflect the parameter estimates after they are back-transformed to the original, un-centered scale. As you can probably appreciate, double centering the parameters leads to some unusual (and in this case, undesirable) behavior!

From the brmsformula() documentation:

Parameterization of the population-level intercept
By default, the population-level intercept (if incorporated) is estimated separately and not as part of population-level parameter vector b As a result, priors on the intercept also have to be specified separately. Furthermore, to increase sampling efficiency, the population-level design matrix X is centered around its column means X_means if the intercept is incorporated. This leads to a temporary bias in the intercept equal to <X_means, b>, where <,> is the scalar product. The bias is corrected after fitting the model, but be aware that you are effectively defining a prior on the intercept of the centered design matrix not on the real intercept. You can turn off this special handling of the intercept by setting argument center to FALSE. For more details on setting priors on population-level intercepts, see set_prior.
This behavior can be avoided by using the reserved (and internally generated) variable Intercept. Instead of y ~ x, you may write y ~ 0 + Intercept + x. This way, priors can be defined on the real intercept, directly. In addition, the intercept is just treated as an ordinary population-level effect and thus priors defined on b will also apply to it. Note that this parameterization may be less efficient than the default parameterization discussed above.

My suspicion is that if you revise your toy examples to be:

Mod <-brm(y|trials(n) ~ 0 + Intercept + Intervention*SexF, prior = prior(cauchy(0,2.5), family = binomial, data = dat)
Mod2 <- update(Mod, newdata = dat2)

the output of summary will be more directly interpretable and consistent with your expectations. I don’t have time to do this myself just now, though.

Hopefully, this is mildly helpful in clarifying this particular issue, though…

1 Like

In a model with interactions, the interpretation of each main effect depends on the coding of the other. In a model, y ~ A + B + A:B, the coef for A corresponds to the effect of A when B=0, and the coef for B corresponds to the effect of B when A=0. Whether it makes sense to center the coefs depends on your subject-matter model. If the predictor is centered, then you’re estimating an average effect over the population. If you set male to be the baseline case, then the coef for intervention corresponds male=0. It depends what question you want to ask. In any case, you can extract anything you want by extracting the information from the simulations.

You’re not uniquely slow in the head. This sort of thing confuses many people, which is a reason we wrote Regression and Other Stories.

The specific issue of interaction coding also happens to be discussed in today’s blog post: “You need 16 times the sample size to estimate an interaction than to estimate a main effect,” explained | Statistical Modeling, Causal Inference, and Social Science

1 Like

That’s a zillion times tighter than the paper’s recommendation of cauchy(2.5). Then again the paper is about weakly informative priors. Given the quoted statement, is it accurate to say that since 2008, Gelman et al have begun to favor “informative” over “weakly informative” priors in logistic regression?

The question is also relevant to this thread. Hopefully the writer of that “not robust” comment will eventually paraphrase their idea in less ambiguous terms.