I am curious about what the intercept means when you include it with an indexing parameter in a linear model. Here are some data to illustrate my example. There are three groups (variable group
: three levels standard
, new1
, and new2
), with average score
6, 12, and 24. I want to estimate the group average score using a sumtozero parameterization, where coefficients represent deflections from a grand mean.
Note that in the transformed parameters
block I have recalculated the group means by transforming the original parameters a0
(intercept) and a[1:3]
generated by the a0 + a[index]
model, and converted them into deflection parameters b[1:3]
, representing deflections from the grand mean b0
.
library(dplyr)
library(rstan)
set.seed(12345)
#create data
df < data.frame(group = factor(rep(c("standard", "new1", "new2"), times = 40), levels = c("standard", "new1", "new2")),
score = rnorm(120, c(6,12,24), 0.5))
# put into list
dList < list(N = nrow(df),
nLevels = nlevels(df$group),
gIndex = as.integer(df$group),
score = df$score,
gMean = mean(df$score),
gSD = sd(df$score))
# model
write("
data{
int<lower=1> N;
int<lower=1> nLevels;
real gMean;
real gSD;
int<lower=1,upper=nLevels> gIndex[N];
real score[N];
}
parameters{
real a0;
vector[nLevels] a;
real sigma;
}
transformed parameters{
real b0;
vector [nLevels] m;
vector [nLevels] b;
for (j in 1:nLevels) {
m[j] = a0 + a[j];
}
b0 = mean(m[1:nLevels]);
for (j in 1:nLevels) {
b[j] = m[j]  b0;
}
}
model{
a0 ~ normal(gMean, 5*gSD);
a ~ normal(0,5*gSD);
sigma ~ cauchy(0,2);
score ~ normal(a0 + a[gIndex], sigma);
}
", file = "temp.stan")
# generate chains
fit < stan(file = "temp.stan",
data = dList,
warmup = 1e3,
iter = 2e3,
cores = 1,
chains = 1)
#put into data fame
chains < data.frame(extract(fit, pars = c("a0", "a", "b0", "b")))
Now if we calculate level means using the a
's and the reparameterized b
's.
chains < mutate(chains,
standard_out = a0 + a.1,
new1_out = a0 + a.2,
new2_out = a0 + a.3,
standard_in = b0 + b.1,
new1_in = b0 + b.2,
new2_in = b0 + b.3)
…and put all the coefficients into a table
bayesCoefs < apply(chains, 2, mean)
output < data.frame(aCoefs = bayesCoefs[1:4],
bCoefs = bayesCoefs[5:8],
levelMeans_a = c(NA, bayesCoefs[9:11]),
levelMeans_b = c(NA, bayesCoefs[12:14]))
row.names(output) < c("intercept", "standard", "new1", "new2")
output
# output
# aCoefs bCoefs levelMeans_a levelMeans_b
# intercept 13.269875 14.109226 NA NA
# standard 7.171809 8.011159 6.098066 6.098066
# new1 1.101934 1.941284 12.167942 12.167942
# new2 10.791795 9.952444 24.061670 24.061670
The output shows several things.

First, the group means calculated outside the model, straight from the
a0 + a[gIndex]
parameters, and the group means calculated from the sumtozero parametersb0
andb[1:3]
(these parameters calculated inside Stan within thetransformed parameters
block) are exactly the same. 
Second, the sumtozero procedure worked, with
b0
being exactly the same as the group mean (or very very close)
mean(df$score)
# [1] 14.1064
output$bCoefs[1]
# [1] 14.10923
…and with the b
deflection coefficients equalling zero (or very very close)
sum(output$bCoefs[2:4])
# [1] 6.661338e16
…but not the a
deflection coefficients
sum(output$aCoefs[2:4])
[1] 2.518052
 Third, the
a
coefficients and theb
coefficients all differ by the same absolute amount
output$aCoefs  output$bCoefs
# [1] 0.8393508 0.8393508 0.8393508 0.8393508
What is going on here?
What is a0
and why is 0.8393508
important?
The a
coefficients are not useless because they allow us to retrieve the group means accurately (which we can then use to calculate contrasts etc) but I am very curious as to how the a0
intercept is calculated. In this example there is not a huge difference, but surely you need to know what the a0
intercept represents when it is accompanied by a vectorised dot product term in order to set a prior on that intercept?
Obviously you could parameterize the likelihood with levelmeans coding (i.e. score ~ normal(a[gIndex, sigma)
) or with referencelevel coding (i.e. score ~ normal(a + bNew1 + bNew2, sigma)
) but there are some advantages to using the grandmean intercept + deflections strategy, especially when models have several factors, with lots of levels and interactions. So I’m keen to work this out.