I’m working with a hierarchical multinomial logit (a.k.a. “categorical_logit” in Stan) model and I’m running into an issue that could be relevant to any hierarchical model involving categorical regressors. I have categorical predictor variables, encoded with effects coding, and I want to ensure that the levels are treated symmetrically.
This isn’t difficult for a single-level model. Section 1.7 of the Stan User’s Guide (1.7 Parameterizing centered vectors | Stan User’s Guide) describes various approaches. Another approach is to use a prior on the coefficients for a K-level categorical variable of the form
\beta_{1:(K-1)} \sim \mathrm{multinormal}(\mathbf{0}, \Sigma) \\
\Sigma_{i,i} = \sigma^2, \quad 1 \leq i < K \\
\Sigma_{i,j} = -\sigma^2 / (K - 1), \quad i\neq j, 1\leq i,j < K \\
\beta_{K} \triangleq -\sum_{i=1}^{K-1}\beta_i
which ensures that permuting the levels leaves their joint prior unchanged. We’ll call a matrix having the same form as 𝛴 above an effects covariance matrix.
The problem is ensuring symmetrical treatment when you have a hierarchical model in which the full covariance matrix 𝛴 is a model parameter to be inferred:
\Sigma \sim \mbox{?} \\ \beta \sim \mathrm{multinormal}(\mathbf{0}, \Sigma) \\ y_{r,n} \sim \mathrm{categorical\_logit}(X_{r,n} \beta_r)\quad\mbox{for } 1\leq r\leq R, 1\leq n\leq N \\ X_{r,n} = \mbox{an }M\times P\mbox{ design matrix} \\ M = \mbox{number of possible outcomes} \\ P = \mbox{number of predictors}
What sort of prior do we use for 𝛴? One possibility is an inverse Wishart prior with a block diagonal scale matrix, with the blocks for (the encodings of) categorical variables being effects covariance matrices. This does work, but there are known shortcomings of using Wishart priors for covariance matrices.
With Stan it is typical to decompose a covariance matrix into a vector of variances and a correlation matrix, and to use an LKJ correlation prior on the correlation matrix. I have tried, and failed, to find any way of using the LKJ correlation prior in a way that would ensure symmetric treatment of the levels of a categorical predictor.
Here is one (failed) attempt. To keep things simple, let’s just consider a single effects-coded categorical predictor variable:
V = \mbox{some effects covariance matrix} \\ L = \mbox{lower Cholesky factor for }V \\ R \sim \mathrm{lkj\_corr}(\eta) \\ \Sigma = L R L'
The above guarantees that
E[\Sigma] = V
but it does not treat the levels symmetrically. You always have
\Sigma_{1,1} = V_{1,1}
but the other diagonal elements of 𝛴 vary, with increasing variance the further down the diagonal you go.
And that’s where I am stuck. I expected that somewhere in the research literature someone would have addressed this issue already, but I have been unable to find anything. Does anyone have any good ideas to offer?