Categorical factor coding in stan

Hi all,

I’m building a stan model by hand and would like to incorporate a four-level unordered factor as a predictor in a regression component. Right now my instinct is to just have three distinct binary factors (leaving the intercept as the fourth) but I wasn’t sure if there was a more efficient / supported way to do this.

Best,
Canaan Breiss

Turning a categorical variable into a set of numeric predictors is called “contrast coding”, and the topic you’re asking might be searchable via “choosing contrast coding” or “contrast coding choice”. Absent any priors, all contrast choices are mathematically identical, but in the context of Bayesian computing, some contrast coding choices are easier to think about than others (and some may change the geometry of the paramter space for easier sampling, but that’s somewhat hard to predict). For example, with a two-level variable, I like to use half-sum contrasts where the first contrast is an intercept (mean-across-conditions, something I usually have decent prior on but don’t really care much about inferentially) and the second contrast is the difference between conditions (something I usually have less prior information on but do care quite a bit about inferentially). With more than 2 conditions there are a variety of contrast options. Take a look here and really just go with whatever helps you create the most meaningful-to-you priors.

9 Likes

That’s how you do it. If you’re using rstan, (or, just using R in general to prep the data), you can specify which contrasts your factor variable should use (sum, treatment, helmert, others are available; sum and ‘treatment’ [aka, dummy-coding] are the most common). E.g., contrasts(mydata$myfactor) <- contr.sum(4); then you can use model.matrix(outcome ~ myfactor, mydata) to create a design matrix. You can then feed that design matrix into stan. I find this to be the easiest approach to constructing what you need.

2 Likes

I see - thanks a lot for the help, folks.

Hello.

Don’t we need to marginalize the categorical variables?
(and use target += … on the model section)

Do we need to marginalize the dummy variables?

No, because as posed above the model seeks inference on the influence of a categorical variable, which means we want a continuous parameter.