I’m curious about people’s approach to modeling unordered categorical response variables using index vs indicator variables. I’ve recently come across an example that causes a substantial difference in estimation, and I am now questioning my understanding of the topic. When I’ve tried to read online about this, the difference between index and indicator variable seems to be represented as a somewhat arbitrary choice by the modeler. However, in the 2nd edition of Statistical Rethinking, Richard McElreath advocates using index variables instead of indicator variables. His justification is that index variables avoids “extra uncertainty” inherent in indicator variables that depend on two parameters - an intercept and slope - and therefore two priors (pg 154 -155).
As I said, I assumed the difference was inconsequential until I came across this situation. In Ch 14 of his book, he produces a varying intercept, varying slope model for this ‘chimpanzee’ dataset (pg 449), and I found that fitted estimates vary rather markedly depending on the approach used for categorical variables. The best approach is not obvious to me.
Below is R code for his model using an indicator approach and then a model using his index approach. Both models are fit using the brms package because that’s what i’m most familiar with. The differences are highlighted when looking at posterior predictions for the 28 different combinations of “actor”, “block” and “treatment”.
library(brms)
library(rethinking)
library(tidyverse)
#data prep
data(chimpanzees)
d <- chimpanzees
d.z <- d %>%
as_tibble() %>%
select(-recipient) %>%
mutate(block_id = block) %>%
mutate(treatment = 1 + prosoc_left + 2*condition) %>%
mutate_all(factor)
# Indicator approach
z.indicator <-
brm(formula = pulled_left ~ (1 + treatment|actor) + (1 + treatment|block) + treatment,
data = d.z, family = bernoulli,
prior = c(prior(normal(0, 1), class = b),
prior(exponential(2), class = sd),
prior(lkj(4), class = cor)),
cores = 4)
# Index approach - this reproduces McElreath's results
z.index <-
brm(formula = bf(pulled_left ~ 0 + tx + a + b,
tx ~ 0 + treatment,
a ~ 0 + (0 + treatment|actor),
b ~ 0 + (0 + treatment|block),
nl =TRUE),
data = d.z, family = bernoulli,
prior = c(prior(normal(0, 1), class = b, nlpar = "tx"),
prior(exponential(1), class = sd, nlpar = "a"),
prior(exponential(1), class = sd, nlpar = "b"),
prior(lkj(2), class = cor)),
cores = 4)
# data for posterior predictions - one for each combo of treatment, actor and block = 28 rows
datp <- tibble(
actor=rep(1:7,each=4) ,
treatment=rep(1:4,times=7) ,
block=rep(5,times=4*7) )
# posterior predictions
indicator.post <- fitted(z.indicator, newdata = datp)
index.post <- fitted(z.index, newdata = datp)
indicator.post
index.post
Not only are the estimates different, the errors for the indicator approach are also smaller than for the index variable. Take row 25 for both posteriors where the estimate for “pulled left” for the indicator is 0.86 with an estimated error of 0.055, and the index has an estimate of 0.75 with an error of 0.11
So, after all of that, my question is which is preferable to you? Is there some sort of error in my coding or understanding? Is there a mathematical justification for choose one over the other that I am not appreciating?
I am hoping someone can put this little annoying detail to rest for me!
Thank you!
-Mark