# Intercept + Dot Product: What Does it Mean?

I am curious about what the intercept means when you include it with an indexing parameter in a linear model. Here are some data to illustrate my example. There are three groups (variable `group`: three levels `standard`, `new1`, and `new2`), with average `score` 6, 12, and 24. I want to estimate the group average score using a sum-to-zero parameterization, where coefficients represent deflections from a grand mean.

Note that in the `transformed parameters` block I have recalculated the group means by transforming the original parameters `a0` (intercept) and `a[1:3]` generated by the `a0 + a[index]` model, and converted them into deflection parameters `b[1:3]`, representing deflections from the grand mean `b0`.

``````library(dplyr)
library(rstan)
set.seed(12345)

#create data
df <- data.frame(group = factor(rep(c("standard", "new1", "new2"), times = 40), levels = c("standard", "new1", "new2")),
score = rnorm(120, c(6,12,24), 0.5))

# put into list
dList <- list(N = nrow(df),
nLevels = nlevels(df\$group),
gIndex = as.integer(df\$group),
score = df\$score,
gMean = mean(df\$score),
gSD = sd(df\$score))

# model
write("
data{
int<lower=1> N;
int<lower=1> nLevels;
real gMean;
real gSD;
int<lower=1,upper=nLevels> gIndex[N];
real score[N];
}

parameters{
real a0;
vector[nLevels] a;
real sigma;
}

transformed parameters{
real b0;
vector [nLevels] m;
vector [nLevels] b;

for (j in 1:nLevels) {
m[j] = a0 + a[j];
}
b0 = mean(m[1:nLevels]);

for (j in 1:nLevels) {
b[j] = m[j] - b0;
}
}

model{
a0 ~ normal(gMean, 5*gSD);
a ~ normal(0,5*gSD);
sigma ~ cauchy(0,2);
score ~ normal(a0 + a[gIndex], sigma);
}
", file = "temp.stan")

# generate chains
fit <- stan(file = "temp.stan",
data = dList,
warmup = 1e3,
iter = 2e3,
cores = 1,
chains = 1)

#put into data fame
chains <- data.frame(extract(fit, pars = c("a0", "a", "b0", "b")))
``````

Now if we calculate level means using the `a`'s and the reparameterized `b`'s.

``````chains <- mutate(chains,
standard_out = a0 + a.1,
new1_out = a0 + a.2,
new2_out = a0 + a.3,
standard_in = b0 + b.1,
new1_in = b0 + b.2,
new2_in = b0 + b.3)
``````

…and put all the coefficients into a table

``````bayesCoefs <- apply(chains, 2, mean)
output <- data.frame(aCoefs = bayesCoefs[1:4],
bCoefs = bayesCoefs[5:8],
levelMeans_a = c(NA, bayesCoefs[9:11]),
levelMeans_b = c(NA, bayesCoefs[12:14]))
row.names(output) <- c("intercept", "standard", "new1", "new2")
output

# output

#              aCoefs    bCoefs levelMeans_a levelMeans_b
# intercept 13.269875 14.109226           NA           NA
# standard  -7.171809 -8.011159     6.098066     6.098066
# new1      -1.101934 -1.941284    12.167942    12.167942
# new2      10.791795  9.952444    24.061670    24.061670
``````

The output shows several things.

• First, the group means calculated outside the model, straight from the `a0 + a[gIndex]` parameters, and the group means calculated from the sum-to-zero parameters `b0` and `b[1:3]` (these parameters calculated inside Stan within the `transformed parameters` block) are exactly the same.

• Second, the sum-to-zero procedure worked, with `b0` being exactly the same as the group mean (or very very close)

``````mean(df\$score)
#  14.1064

output\$bCoefs
#  14.10923
``````

…and with the `b` deflection coefficients equalling zero (or very very close)

``````sum(output\$bCoefs[2:4])
#  6.661338e-16
``````

…but not the `a` deflection coefficients

``````sum(output\$aCoefs[2:4])
 2.518052
``````
• Third, the `a` coefficients and the `b` coefficients all differ by the same absolute amount
``````output\$aCoefs - output\$bCoefs
#  -0.8393508  0.8393508  0.8393508  0.8393508
``````

What is going on here?

What is `a0` and why is `0.8393508` important?

The `a` coefficients are not useless because they allow us to retrieve the group means accurately (which we can then use to calculate contrasts etc) but I am very curious as to how the `a0` intercept is calculated. In this example there is not a huge difference, but surely you need to know what the `a0` intercept represents when it is accompanied by a vectorised dot product term in order to set a prior on that intercept?

Obviously you could parameterize the likelihood with level-means coding (i.e. `score ~ normal(a[gIndex, sigma)`) or with reference-level coding (i.e. `score ~ normal(a + bNew1 + bNew2, sigma)`) but there are some advantages to using the grand-mean intercept + deflections strategy, especially when models have several factors, with lots of levels and interactions. So I’m keen to work this out.

That was an interesting puzzle to think about. I think it works as follows.

The `a0 + a[gIndex]` formulation is not identified on it’s own. There are three means but four parameters are estimated. If you add whatever number (say `0.8393508`) to a0 and subtract that same number from a[gIndex] you have the same model. The reason why this still works is that the priors restrict how much you can add or subtract i.e. `a0` is very likely to be between `-10gSD` and `10gSD`. So the interpretation of a0 is that it’s an estimate of the grand mean regularized by the prior on the grand mean (and probably also indirectly by the priors on the deviations `a[gIndex]`).

1 Like

Thank you @stijn, you make a lot of sense. I didn’t consider that the a0 is really a superfluous parameter. It is interesting that the model finds the grand mean but doesn’t quite get there. Kruschke uses this parameterization in his book a lot and sometimes it is fiendishly difficult to calculate the sum-to-zero parameters, e.g. with split-plot designs where there are multiple subjects who get exposed to within- and between-subject factors. But despite the complexity I can see the value of it. It is quite neat, despite this extra parameter the model doesn’t really need.