Suppose I have a model with multiple categorical predictors, e.g. response ~ ethnicity + religion
What’s the recommendation for setting up a weakly informative prior in this situation?
It wouldn’t make sense to use independent priors (e.g. Normal(0,1)) on all coefficients here, because then the prior predictive is asymmetric (the left out category has much less variance!)… So instead I’m usually inclined to just rewrite the model as a something like: response ~ (1 | ethnicity) + (1 | religion)
…and set the σ prior to a constant. However:
(a) This feels inefficient, since it’s forcing stan to infer something I don’t care about – the μ for a population of unobserved ethnicities (/religions).
(b) The coefficients are no longer interpretable as I’d like them to be – for example, the posterior on the intercept (and the random effects) will continue to have uncertainty even in the infinite data limit.
It feels that it should just be possible to write the model without random effects at all, and using e.g. mean-centered predictors, but then just add some negative correlation into the prior to make the implied prior-predictive distribution the same for the left-out ethnicity (/religion) as for the others. Does anybody have any good tricks for this situation?
One option is to use effects coding rather than dummy coding for the categorical predictors. For a binary predictor, effects coding means using -1/1 instead of 0/1 (which is dummy coding). For a categorical predictor, the dummy coding expands to a column of zeros and ones for every category except the reference category. Replace the zeros with -1’s and you have effects coding.
I’m reasonably certain that Agresti mentions this solution somewhere in his book “Categorical data analysis”.
Cool thanks, haven’t come across that before – I’ll check out the book!
But just to be clear, using independent & identical priors for a variable coded this way would still imply an asymmetric prior predictive. E.g. If the coding was:
g e1 e2 e3
1: 1 0 0
2: 0 1 0
3: 0 0 1
4: -1 -1 -1
with a N(0,1) prior on the regression coefficients for e1, e2, e3, then the prior predictive when g=4 will have 3x more variance than when g=1,2,3, right?
(Whereas if the code for g=4 was (-1/√3, -1/√3, -1/√3) then I guess things would be balanced, but the parameters have a weird interpretation…)