Why center a dummy variable in Bayesian regression?

blokeman · June 13, 2019, 12:49pm

The usual coding of a dummy variable, with one group chosen as the “control group” (x = 0), and the other representing the effect of the “treatment” or “group membership” (x = 1) has the interpretive advantage that the Intercept represents the fitted value for an observation in the control group (with other explanatory variables, if any, also at 0). However, today I ran into the recommendation, issued by Agresti (2018: 142), that dummies should be centered in Bayesian regression:

Instead of the usual (0, 1) coding for the indicator variable x1, we let it take values −0.5 and 0.5. The prior distribution is then symmetric in the sense that the logits for each neovasculation group have the same prior variability as well as the same prior means, yet β1 still has the usual interpretation of a conditional log odds ratio.

The way I see it, such centering destroys the simple and intuitive interpretation of the Intercept as the fitted value for a “typical” data point (here the Control Group). There must be some countervailing advantage. Therefore, I’m trying to translate the above statement into something a non-statistician can easily understand.

Say we’re modeling a binary outcome, and there’s a Control Group (x = 0) and a Treatment Group (x = 1), with β_1 for x having a fairly tight prior, say normal(0, 1). And say there’s an Intercept, α. With x not centered, the fitted logit for the Control Group will be α, whilst the fitted logit for the Treatment Group will be α + β_1x. Thus, given that it depends on two regression parameters, the Treatment Group’s fitted logit undergoes more shrinkage than that of the Control Group, which only depends on one regression parameter. By contrast, in a model with x represented as dichotomous between -0.5 and 0.5, the respective fitted logits will be α - 0.5β_1x for the Control Group and α + 0.5β_1x for the Treatment Group, both undergoing the exact same amount of shrinkage (one and a half parameters, as it were). Is this the point that the author is trying to make?

Agresti, Alan. 2018. Introduction to Categorical Data Analysis.3rd ed. John Wiley & Sons.

Max_Mantei · June 13, 2019, 3:29pm

Hm… I think the question is, what you are centering on. In the simple example you give, centering will change the meaning of \alpha (and \beta). If you “center on” the control group (x = 0), then \alpha is the estimate for the control group (obviously). \beta is just the estimated difference of the treatment group from the control. You could also “center on” the treatment group (flip the dummy variable) and then \alpha is the estimate for the treatment group and \beta the difference to the control group. In either way, you should change your priors about \alpha (!) and \beta accordingly, and then it is basically the same model.

If you center x so that control group x^* = -0.5 and treatment group x^* = 0.5, then the estimate for the control group is \alpha + \beta(-0.5) and for the treatment group it is \alpha + \beta(0.5) [you did a little mistake there, I think]. Since, you can never observe x^*=0 in this situation, \alpha seemingly has lost it’s nice interpretation. Unless, \bar x = 0.5, implying the number of people in control and treatment group are equal. Then \alpha is the estimate for the average of the whole sample (ignoring groups). This means, that your prior on this should be different. Note, that \beta still gives you the difference of control to treatment group (your prior on this shouldn’t change).

As I see it, this comes down to preference and the application. If you are cautious (esp. with the priors), you can extract the same information from either specification. It may well be that you have a more informed guess (prior) on some quantities (the sample average outcome for example, or the average outcome in the control group, etc.) so that it is more reasonable to chose a certain parameterization. Does this make sense?

blokeman · June 13, 2019, 6:03pm

Thanks Max. Your second paragraph seems to confirm what I was getting at. Essentially, re-coding the Treatment dummy as -0.5 (for the control group) and 0.5 (for the treatment group) has the consequence that α now represents the “midway point” between those two groups. β, as you state, still describes the difference between the two groups. And as you note, x = 0 is never observed, losing its simple interpretation. But the crucial advantage gained from this is that now, with α representing the midway point between the groups, β becomes symmetrical around zero, so that Group 1 is described by its negative half and Group 2 by the positive half, relative to the midway point. As a consequence, the fitted values for both groups are equally affected by the prior(s). This, I think, is what Agresti means by “symmetricity”. Presumably in medical contexts, estimates of absolute disease risk for different groups can be every bit as important as inference about effects alone, which is why such absolute equality of shrinkage could be important.

Max_Mantei · June 13, 2019, 8:28pm

I just saw that there was a typo in my post: In the second paragraph I meant to write

Unless, \bar x = 0.5 (!), implying the number of people in control and treatment group are equal.

You can also think of a case, where you only have 20% of the individuals in the treatment group so that \bar x = 0.2. Centering then implies x^*_{\text{control}}=0-0.2=-0.2 and x^*_{\text{treatment}}=1-0.2=0.8. I guess this situation is unlikely in the medical context, but I’m not sure about this.

I don’t understand what you mean by

[…] which is why such absolute equality of shrinkage could be important.

In a Bayesian setting shrinkage is done via the prior, and you’re free to specify a strong prior any coefficient (including \alpha). I might be missing something here?

blokeman · June 14, 2019, 6:16am

I think this would center x around its mean, resulting in an Intercept that represents the overall sample mean (as your previous post says). In the example at hand, however, x is being centered around zero, and I’m trying to figure out why.

In the case at hand, only one prior is used – one that applies to all population-level parameters (\beta's) . It was misleading of me to denote the Intercept by \alpha (as if it were treated separately), when in fact it is conceptualized simply as \beta_0. Now, given that the same prior applies to all parameters, centering x around zero achieves that E(y)_{control} = \beta_0 + \beta_1(-0.5) and E(y)_{treatment} = \beta_0 + \beta_1(0.5) i.e. the fitted values of both groups are subject to exactly the same degree of shrinkage – which is presumably desirable when modeling something like the cancer risk of different groups of people.

Max_Mantei · June 14, 2019, 11:37am

This helped me to wrap my head around it:

> ### case 1: control group and treatment group about equally sized
  #################################################################
>
> df <- tibble(x = rbinom(10000, 1, 0.5), y = 3 + 0.5*x + rnorm(10000))
> 
> df <- df %>% mutate(x_centered = x - 0.5,
+                     x_mean_centered = x - mean(x))
> 
> lm(y ~ x, data = df) %>% summary()

Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5372 -0.6767 -0.0009  0.6635  3.7522 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.98743    0.01393  214.46   <2e-16 ***
x            0.51158    0.01984   25.78   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.992 on 9998 degrees of freedom
Multiple R-squared:  0.06234,	Adjusted R-squared:  0.06225 
F-statistic: 664.8 on 1 and 9998 DF,  p-value: < 2.2e-16

> 
> lm(y ~ x_centered, data = df) %>% summary()

Call:
lm(formula = y ~ x_centered, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5372 -0.6767 -0.0009  0.6635  3.7522 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.243216   0.009921  326.91   <2e-16 ***
x_centered  0.511576   0.019841   25.78   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.992 on 9998 degrees of freedom
Multiple R-squared:  0.06234,	Adjusted R-squared:  0.06225 
F-statistic: 664.8 on 1 and 9998 DF,  p-value: < 2.2e-16

> 
> lm(y ~ x_mean_centered, data = df) %>% summary()

Call:
lm(formula = y ~ x_mean_centered, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5372 -0.6767 -0.0009  0.6635  3.7522 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3.23958    0.00992  326.58   <2e-16 ***
x_mean_centered  0.51158    0.01984   25.78   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.992 on 9998 degrees of freedom
Multiple R-squared:  0.06234,	Adjusted R-squared:  0.06225 
F-statistic: 664.8 on 1 and 9998 DF,  p-value: < 2.2e-16

> 
> 
> df <- tibble(x = rbinom(10000, 1, 0.5), y = 3 + 0.5*x + rnorm(10000))
> 
> df <- df %>% mutate(x_centered = x - 0.5,
+                     x_mean_centered = x - mean(x))
> 
> ### case 2: control group and treatment group NOT equally sized
  ###############################################################
> 
> df <- tibble(x = rbinom(10000, 1, 0.75), y = 3 + 0.5*x + rnorm(10000))
> 
> df <- df %>% mutate(x_centered = x - 0.5,
+                     x_mean_centered = x - mean(x))
> 
> lm(y ~ x, data = df) %>% summary()

Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8575 -0.6689  0.0011  0.6831  3.9168 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.02229    0.02051  147.32   <2e-16 ***
x            0.48998    0.02350   20.85   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1 on 9998 degrees of freedom
Multiple R-squared:  0.04168,	Adjusted R-squared:  0.04159 
F-statistic: 434.9 on 1 and 9998 DF,  p-value: < 2.2e-16

> 
> lm(y ~ x_centered, data = df) %>% summary()

Call:
lm(formula = y ~ x_centered, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8575 -0.6689  0.0011  0.6831  3.9168 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.26727    0.01175  278.11   <2e-16 ***
x_centered   0.48998    0.02350   20.85   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1 on 9998 degrees of freedom
Multiple R-squared:  0.04168,	Adjusted R-squared:  0.04159 
F-statistic: 434.9 on 1 and 9998 DF,  p-value: < 2.2e-16

> 
> lm(y ~ x_mean_centered, data = df) %>% summary()

Call:
lm(formula = y ~ x_mean_centered, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8575 -0.6689  0.0011  0.6831  3.9168 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3.3958     0.0100  339.52   <2e-16 ***
x_mean_centered   0.4900     0.0235   20.85   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1 on 9998 degrees of freedom
Multiple R-squared:  0.04168,	Adjusted R-squared:  0.04159 
F-statistic: 434.9 on 1 and 9998 DF,  p-value: < 2.2e-16

Topic		Replies	Views
Mixed up with the understanding of y~0+ intercept+x and y~x Modeling techniques , specification	1	1473	June 18, 2020
Interpret parameters after inv_logit transform in hierarchical non-centered parameterization Modeling interpret-results	1	562	May 11, 2021
Simple question: logistic regression with group level covariate and non-centered parameterization Modeling specification	6	838	November 19, 2018
(Dummy) coding/scaling of binary variable and prior choice in BRMS brms prior-choice	2	2553	June 9, 2020
Centered vs noncentered - General	4	4228	August 14, 2017

Why center a dummy variable in Bayesian regression?

Related topics