Symmetric weakly informative priors for categorical predictors

Suppose I have a model with multiple categorical predictors, e.g.
response ~ ethnicity + religion
What’s the recommendation for setting up a weakly informative prior in this situation?

It wouldn’t make sense to use independent priors (e.g. Normal(0,1)) on all coefficients here, because then the prior predictive is asymmetric (the left out category has much less variance!)… So instead I’m usually inclined to just rewrite the model as a something like:
response ~ (1 | ethnicity) + (1 | religion)
…and set the σ prior to a constant. However:
(a) This feels inefficient, since it’s forcing stan to infer something I don’t care about – the μ for a population of unobserved ethnicities (/religions).
(b) The coefficients are no longer interpretable as I’d like them to be – for example, the posterior on the intercept (and the random effects) will continue to have uncertainty even in the infinite data limit.

It feels that it should just be possible to write the model without random effects at all, and using e.g. mean-centered predictors, but then just add some negative correlation into the prior to make the implied prior-predictive distribution the same for the left-out ethnicity (/religion) as for the others. Does anybody have any good tricks for this situation?

3 Likes

One option is to use effects coding rather than dummy coding for the categorical predictors. For a binary predictor, effects coding means using -1/1 instead of 0/1 (which is dummy coding). For a categorical predictor, the dummy coding expands to a column of zeros and ones for every category except the reference category. Replace the zeros with -1’s and you have effects coding.

I’m reasonably certain that Agresti mentions this solution somewhere in his book “Categorical data analysis”.

Not mentioned to my knowledge by Agresti is a knock-on benefit of effects coding as opposed to dummy coding: you get the benefits of @andrewgelman’s recommendation to standardize continuous covariates by dividing by two standard deviations by just doing the “usual” thing of dividing by one standard deviation.

2 Likes

Hi, yes, I discussed that point in this post from 2010: https://statmodeling.stat.columbia.edu/2010/04/12/a_question_abou_9/

1 Like

Cool thanks, haven’t come across that before – I’ll check out the book!

But just to be clear, using independent & identical priors for a variable coded this way would still imply an asymmetric prior predictive. E.g. If the coding was:

g  e1  e2  e3
1:  1   0   0
2:  0   1   0
3:  0   0   1
4: -1  -1  -1

with a N(0,1) prior on the regression coefficients for e1, e2, e3, then the prior predictive when g=4 will have 3x more variance than when g=1,2,3, right?
(Whereas if the code for g=4 was (-1/√3, -1/√3, -1/√3) then I guess things would be balanced, but the parameters have a weird interpretation…)

Replace all the zeros with -1’s

2 Likes

😮Genius! Exactly the kind of neat trick I was hoping for, thanks.

(I was confused because the first hit for google on effects coding is has it this way)

2 Likes

Yikes, maybe I’ve been using the term “effects coding” wrong for quite a while…

1 Like