Index vs Indicator variables for unordered categories when using multilevel models

I’m curious about people’s approach to modeling unordered categorical response variables using index vs indicator variables. I’ve recently come across an example that causes a substantial difference in estimation, and I am now questioning my understanding of the topic. When I’ve tried to read online about this, the difference between index and indicator variable seems to be represented as a somewhat arbitrary choice by the modeler. However, in the 2nd edition of Statistical Rethinking, Richard McElreath advocates using index variables instead of indicator variables. His justification is that index variables avoids “extra uncertainty” inherent in indicator variables that depend on two parameters - an intercept and slope - and therefore two priors (pg 154 -155).

As I said, I assumed the difference was inconsequential until I came across this situation. In Ch 14 of his book, he produces a varying intercept, varying slope model for this ‘chimpanzee’ dataset (pg 449), and I found that fitted estimates vary rather markedly depending on the approach used for categorical variables. The best approach is not obvious to me.

Below is R code for his model using an indicator approach and then a model using his index approach. Both models are fit using the brms package because that’s what i’m most familiar with. The differences are highlighted when looking at posterior predictions for the 28 different combinations of “actor”, “block” and “treatment”.


library(brms)
library(rethinking)
library(tidyverse)

#data prep 
data(chimpanzees)
d <- chimpanzees
d.z <- d %>%
  as_tibble() %>% 
  select(-recipient) %>%
  mutate(block_id = block) %>% 
  mutate(treatment = 1 + prosoc_left + 2*condition) %>% 
  mutate_all(factor)

# Indicator approach
z.indicator <- 
  brm(formula = pulled_left ~ (1 + treatment|actor) + (1 + treatment|block) + treatment,
      data = d.z, family = bernoulli,
      prior = c(prior(normal(0, 1), class = b),
                prior(exponential(2), class = sd),
                prior(lkj(4), class = cor)),
      cores = 4)

# Index approach - this reproduces McElreath's results 
z.index <- 
  brm(formula = bf(pulled_left ~ 0 + tx + a + b,
                   tx ~ 0 + treatment,
                   a ~ 0 + (0 + treatment|actor),
                   b ~ 0 + (0 + treatment|block),
                   nl =TRUE),
      data = d.z, family = bernoulli,
      prior = c(prior(normal(0, 1), class = b, nlpar = "tx"),
                prior(exponential(1), class = sd, nlpar = "a"),
                prior(exponential(1), class = sd, nlpar = "b"),
                prior(lkj(2), class = cor)),
      cores = 4)

# data for posterior predictions - one for each combo of treatment, actor and block = 28 rows
datp <- tibble(
  actor=rep(1:7,each=4) ,
  treatment=rep(1:4,times=7) ,
  block=rep(5,times=4*7) )

# posterior predictions 
indicator.post  <- fitted(z.indicator, newdata = datp)
index.post  <- fitted(z.index, newdata = datp)

indicator.post
index.post

Not only are the estimates different, the errors for the indicator approach are also smaller than for the index variable. Take row 25 for both posteriors where the estimate for “pulled left” for the indicator is 0.86 with an estimated error of 0.055, and the index has an estimate of 0.75 with an error of 0.11

So, after all of that, my question is which is preferable to you? Is there some sort of error in my coding or understanding? Is there a mathematical justification for choose one over the other that I am not appreciating?

I am hoping someone can put this little annoying detail to rest for me!
Thank you!

-Mark

When I think of this, I think of doing random effects using for loops or encoding things as matrices, but either way it’s for computing the same thing by definition in my head.

There’s probably some detail of categorical responses or something I’m assuming incorrectly here.

Anyway, to see the difference in the two models you wrote, it’s probably best to look at the code using make_stancode (and similarly make_standata to see the data for these models). Something like this will write two files “indicator.stan” and “index.stan” and you can dig through them and see the difference:

# Indicator approach
writeLines(make_stancode(formula = pulled_left ~ (1 + treatment|actor) + (1 + treatment|block) + treatment,
      data = d.z, family = bernoulli,
      prior = c(prior(normal(0, 1), class = b),
                prior(exponential(2), class = sd),
                prior(lkj(4), class = cor)),
      cores = 4), "indicator.stan")

# Index approach - this reproduces McElreath's results 
writeLines(make_stancode(formula = bf(pulled_left ~ 0 + tx + a + b,
                   tx ~ 0 + treatment,
                   a ~ 0 + (0 + treatment|actor),
                   b ~ 0 + (0 + treatment|block),
                   nl =TRUE),
      data = d.z, family = bernoulli,
      prior = c(prior(normal(0, 1), class = b, nlpar = "tx"),
                prior(exponential(1), class = sd, nlpar = "a"),
                prior(exponential(1), class = sd, nlpar = "b"),
                prior(lkj(2), class = cor)),
      cores = 4), "index.stan")

The priors look slightly different and also the second is missing a global intercept, so there are differences here.