Here: Representing categorical variables in Stan, I asked about coding categorical variables in Stan.
However, how should I go about coding reference levels for my factors?
Specifically, when using R’s
glm(), assigning reference levels is handled automatically.
For Stan, do I first need to coerce categories to factors via
as.factor() then convert the factors to integers using
as.integer()? Doing this however keeps all levels, instead of using one as the reference.
The index variable approach you’re using in the original thread is inherently agnostic about which group is the reference group, so it doesn’t lend itself as naturally to having a reference level as the dummy variable/one-hot encoded approach also described in the linked StackOverflow thread.
However, if it’s important to you to have, say, the first level in each grouping factor be a reference level, and you want to keep the index variable structure, perhaps you could code something in the
generated quantities block to compute
alpha[i] - alpha for each
i > 1 so that your model would output differences from the first level in addition to the raw intercepts.
Thanks @mpdylan! This sounds like a good approach.
Just a note. You say:
but the model that you are working with contains random effects and cannot be fit by
glm. This might be a source of some of your confusion, because the need for a reference level is a feature of the fixed effects specification. In a model such as yours, where a factor variable enters only as a grouping factor for random effects, the intercept corresponds to the grand mean (i.e. the average city) rather than to any particular reference city.