Intercept interpretation in regression model with one binary predictor

Hi everyone,

I’m just starting out with Stan, so please bear with me.

I have fit a linear regression model with a continuous response and just one binary predictor using Stan in R. I am essentially comparing the mean of two groups. The estimate for the regression coefficient of the binary predictor (i.e. the difference between the two groups) is the same as when using the lm() function in R, but the estimate for the intercept is different.

For info, the (sample) mean of group A is 1.46 and the (sample) mean of group B is 4.26. The difference between the two is 2.8, which is equal to the regression coefficient for X. Using lm(), the intercept is 1.46, which is the mean for group A. Using Stan, the intercept is -1.34. To get the mean of group A we have to add -1.34 and 2.8 (equal to 1.46), to get the mean of group B we have to add -1.34 and 2x2.8 (equal to 4.26). This seems so strange to me. Is this always the case?

I know there are more straightforward ways to compare two group means, but this is not my focus. I just want to understand how to interpret the intercept here.

Thank you!

Stan code for ‘linreg’ model:

data{
  int<lower=0> N;
  vector[N] y;
  vector[N] x;
}

parameters{
  real alpha;
  real beta;
  real<lower=0> sigma;
}

model{
  y ~ normal(alpha + beta * x, sigma);
}

R code:

data(iris)
index <- which(iris$Species %in% c('versicolor', 'setosa'))
x <- as.integer(iris$Species[index])
y <- iris$Petal.Length[index]
data <- list(N=length(y), x=x, y=y)

fit <- stan(file='linreg.stan', data=data)

Results:

mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
alpha -1.34    0.00 0.12 -1.56 -1.42 -1.34 -1.26 -1.11  1575    1
beta   2.80    0.00 0.07  2.66  2.75  2.80  2.85  2.94  1580    1
sigma  0.36    0.00 0.03  0.31  0.34  0.36  0.37  0.41  1901    1
lp__  52.10    0.03 1.25 48.99 51.51 52.43 53.02 53.52  1382    1

Welcome to the Stan Forum.

Could you share your code for lm()? I suspect that you’re actually running a dummy-coded model there with Group A as the reference category. In that case, the model is recoding group indicators as 0/1, thus the intercept aligning with the Group A mean.

By contrast, your Stan model is treating the group indicator as numeric data. Since Group A = 1 and Group B = 2, the intercept correspondents to x = 0, which doesn’t align with any group. You can easily replicate this with lm() (copying your code).

data(iris)
index <- which(iris$Species %in% c(‘versicolor’, ‘setosa’))
x <-  as.integer(iris$Species[index])
y <-  iris$Petal.Length[index]
data <- list(N=length(y), x=x, y=y)

lm(y ~ x)

The simplest option would be to use the same Stan code and subtract 1 from x. That should give you the same results.

That totally clears it up!

Yes, indeed, with lm() the intercept represents the mean of group A here. This is the code I was using:

lm(Petal.Length ~ Species, data=iris[index,])

With a 0/1 coding, Stan then gives me the same results.

Thank you so much! You have saved me from having to ask a stupid question to my supervisor!