Intercept + Dot Product: What Does it Mean?

I am curious about what the intercept means when you include it with an indexing parameter in a linear model. Here are some data to illustrate my example. There are three groups (variable group: three levels standard, new1, and new2), with average score 6, 12, and 24. I want to estimate the group average score using a sum-to-zero parameterization, where coefficients represent deflections from a grand mean.

Note that in the transformed parameters block I have recalculated the group means by transforming the original parameters a0 (intercept) and a[1:3] generated by the a0 + a[index] model, and converted them into deflection parameters b[1:3], representing deflections from the grand mean b0.

library(dplyr)
library(rstan)
set.seed(12345)

#create data
df <- data.frame(group = factor(rep(c("standard", "new1", "new2"), times = 40), levels = c("standard", "new1", "new2")),
                 score = rnorm(120, c(6,12,24), 0.5))

# put into list
dList <- list(N = nrow(df),
              nLevels = nlevels(df$group),
              gIndex = as.integer(df$group),
              score = df$score,
              gMean = mean(df$score),
              gSD = sd(df$score))

# model
write("
      data{
      int<lower=1> N;
      int<lower=1> nLevels;
      real gMean;
      real gSD;
      int<lower=1,upper=nLevels> gIndex[N];
      real score[N];
      }

      parameters{
      real a0;
      vector[nLevels] a;
      real sigma;
      }

      transformed parameters{
      real b0;
      vector [nLevels] m;
      vector [nLevels] b;

        for (j in 1:nLevels) {
          m[j] = a0 + a[j];
        }
      b0 = mean(m[1:nLevels]);

        for (j in 1:nLevels) {
          b[j] = m[j] - b0;
        }
      }

      model{
      a0 ~ normal(gMean, 5*gSD);
      a ~ normal(0,5*gSD);
      sigma ~ cauchy(0,2);
      score ~ normal(a0 + a[gIndex], sigma);
      }
      ", file = "temp.stan")

# generate chains
fit <- stan(file = "temp.stan",
            data = dList,
            warmup = 1e3,
            iter = 2e3,
            cores = 1,
            chains = 1)

#put into data fame
chains <- data.frame(extract(fit, pars = c("a0", "a", "b0", "b"))) 

Now if we calculate level means using the a's and the reparameterized b's.

chains <- mutate(chains, 
                 standard_out = a0 + a.1,
                 new1_out = a0 + a.2,
                 new2_out = a0 + a.3,
                 standard_in = b0 + b.1,
                 new1_in = b0 + b.2,
                 new2_in = b0 + b.3)

…and put all the coefficients into a table

bayesCoefs <- apply(chains, 2, mean)
output <- data.frame(aCoefs = bayesCoefs[1:4], 
                     bCoefs = bayesCoefs[5:8],
                     levelMeans_a = c(NA, bayesCoefs[9:11]),
                     levelMeans_b = c(NA, bayesCoefs[12:14]))
row.names(output) <- c("intercept", "standard", "new1", "new2")
output

# output

#              aCoefs    bCoefs levelMeans_a levelMeans_b
# intercept 13.269875 14.109226           NA           NA
# standard  -7.171809 -8.011159     6.098066     6.098066
# new1      -1.101934 -1.941284    12.167942    12.167942
# new2      10.791795  9.952444    24.061670    24.061670

The output shows several things.

  • First, the group means calculated outside the model, straight from the a0 + a[gIndex] parameters, and the group means calculated from the sum-to-zero parameters b0 and b[1:3] (these parameters calculated inside Stan within the transformed parameters block) are exactly the same.

  • Second, the sum-to-zero procedure worked, with b0 being exactly the same as the group mean (or very very close)

mean(df$score)
# [1] 14.1064

output$bCoefs[1]
# [1] 14.10923

…and with the b deflection coefficients equalling zero (or very very close)

sum(output$bCoefs[2:4])
# [1] 6.661338e-16

…but not the a deflection coefficients

sum(output$aCoefs[2:4])
[1] 2.518052
  • Third, the a coefficients and the b coefficients all differ by the same absolute amount
output$aCoefs - output$bCoefs 
# [1] -0.8393508  0.8393508  0.8393508  0.8393508

What is going on here?

What is a0 and why is 0.8393508 important?

The a coefficients are not useless because they allow us to retrieve the group means accurately (which we can then use to calculate contrasts etc) but I am very curious as to how the a0 intercept is calculated. In this example there is not a huge difference, but surely you need to know what the a0 intercept represents when it is accompanied by a vectorised dot product term in order to set a prior on that intercept?

Obviously you could parameterize the likelihood with level-means coding (i.e. score ~ normal(a[gIndex, sigma)) or with reference-level coding (i.e. score ~ normal(a + bNew1 + bNew2, sigma)) but there are some advantages to using the grand-mean intercept + deflections strategy, especially when models have several factors, with lots of levels and interactions. So I’m keen to work this out.

That was an interesting puzzle to think about. I think it works as follows.

The a0 + a[gIndex] formulation is not identified on it’s own. There are three means but four parameters are estimated. If you add whatever number (say 0.8393508) to a0 and subtract that same number from a[gIndex] you have the same model. The reason why this still works is that the priors restrict how much you can add or subtract i.e. a0 is very likely to be between -10gSD and 10gSD. So the interpretation of a0 is that it’s an estimate of the grand mean regularized by the prior on the grand mean (and probably also indirectly by the priors on the deviations a[gIndex]).

1 Like

Thank you @stijn, you make a lot of sense. I didn’t consider that the a0 is really a superfluous parameter. It is interesting that the model finds the grand mean but doesn’t quite get there. Kruschke uses this parameterization in his book a lot and sometimes it is fiendishly difficult to calculate the sum-to-zero parameters, e.g. with split-plot designs where there are multiple subjects who get exposed to within- and between-subject factors. But despite the complexity I can see the value of it. It is quite neat, despite this extra parameter the model doesn’t really need.