Symmetric effects prior for a hierarchical model

Kevin_Van_Horn · February 3, 2023, 5:38pm

I’m working with a hierarchical multinomial logit (a.k.a. “categorical_logit” in Stan) model and I’m running into an issue that could be relevant to any hierarchical model involving categorical regressors. I have categorical predictor variables, encoded with effects coding, and I want to ensure that the levels are treated symmetrically.

This isn’t difficult for a single-level model. Section 1.7 of the Stan User’s Guide (1.7 Parameterizing centered vectors | Stan User’s Guide) describes various approaches. Another approach is to use a prior on the coefficients for a K-level categorical variable of the form

\beta_{1:(K-1)} \sim \mathrm{multinormal}(\mathbf{0}, \Sigma) \\ \Sigma_{i,i} = \sigma^2, \quad 1 \leq i < K \\ \Sigma_{i,j} = -\sigma^2 / (K - 1), \quad i\neq j, 1\leq i,j < K \\ \beta_{K} \triangleq -\sum_{i=1}^{K-1}\beta_i
which ensures that permuting the levels leaves their joint prior unchanged. We’ll call a matrix having the same form as 𝛴 above an effects covariance matrix.

The problem is ensuring symmetrical treatment when you have a hierarchical model in which the full covariance matrix 𝛴 is a model parameter to be inferred:

\Sigma \sim \mbox{?} \\ \beta \sim \mathrm{multinormal}(\mathbf{0}, \Sigma) \\ y_{r,n} \sim \mathrm{categorical\_logit}(X_{r,n} \beta_r)\quad\mbox{for } 1\leq r\leq R, 1\leq n\leq N \\ X_{r,n} = \mbox{an }M\times P\mbox{ design matrix} \\ M = \mbox{number of possible outcomes} \\ P = \mbox{number of predictors}

What sort of prior do we use for 𝛴? One possibility is an inverse Wishart prior with a block diagonal scale matrix, with the blocks for (the encodings of) categorical variables being effects covariance matrices. This does work, but there are known shortcomings of using Wishart priors for covariance matrices.

With Stan it is typical to decompose a covariance matrix into a vector of variances and a correlation matrix, and to use an LKJ correlation prior on the correlation matrix. I have tried, and failed, to find any way of using the LKJ correlation prior in a way that would ensure symmetric treatment of the levels of a categorical predictor.

Here is one (failed) attempt. To keep things simple, let’s just consider a single effects-coded categorical predictor variable:

V = \mbox{some effects covariance matrix} \\ L = \mbox{lower Cholesky factor for }V \\ R \sim \mathrm{lkj\_corr}(\eta) \\ \Sigma = L R L'

The above guarantees that

E[\Sigma] = V

but it does not treat the levels symmetrically. You always have

\Sigma_{1,1} = V_{1,1}

but the other diagonal elements of 𝛴 vary, with increasing variance the further down the diagonal you go.

And that’s where I am stuck. I expected that somewhere in the research literature someone would have addressed this issue already, but I have been unable to find anything. Does anyone have any good ideas to offer?

jsocolar · February 3, 2023, 5:50pm

Are you interested primarily in figuring out a prior for your \Sigma, or would you be equally interested in other options for ensuring symmetric treatment of the levels of a categorical predictor?

Kevin_Van_Horn · February 4, 2023, 10:47pm

Symmetric treatment is the most important issue to me, while still having a hierarchical model in which the correlations are are learned.

jsocolar · February 5, 2023, 4:05am

I’m probably being dense here, but could you elaborate on:

how/why does a centered vector provide a solution to the problem in the non-hierarchical case?
why in your “another approach” do you appear to define K coefficients (rather than K - 1 coefficients) for a K-level categorical variable?

For both of these questions, what happens with a 3-level categorical variable, effects-coded with [1, 0, -1] for the three categories in one column of the design matrix, and [0, 1, -1] in the second column? Or do you mean something different when you say “effects coding”? Sorry if I’m just missing something obvious.

Kevin_Van_Horn · February 5, 2023, 10:04pm

The centered vector is only part of the solution in the non-hierarchical case; it’s mainly there to give us identifiability. The covariance matrix for the non-hierarchical case is derived here:

The negative correlations guarantee that, if we define

\beta_K = -\sum_{i=1}^{K-1}

then

V[\beta_K] = V[\beta_j] \quad \forall 1\leq j < K \\ C[\beta_K, \beta_j] = C[\beta_k, \beta_j]\quad \forall 1\leq j,k < K,\; j\neq k.

On the second bullet point: No, 𝛴, V, and R are all (K-1) x (K-1) matrices. They can be extended to singular K x K matrices using the definition of beta[K].

jsocolar · February 6, 2023, 3:09am

I don’t know if this fits inside the solution space that you’re interested in or not (maybe it’s already obvious to you), but in general I would approach the problem of a symmetric prior on the levels of a categorical variable by dummy-coding the categorical variable, but replacing all of the 0’s in the dummy coding with -1s. Now as long as the prior over those coefficients is symmetric, the prior on the outcome will be symmetric as well. For example, the prior on the coefficients could be \mathcal{N}(0, \Sigma) with an LKJ prior prior on \Sigma.

Kevin_Van_Horn · February 7, 2023, 4:42am

I’m not sure what you mean. Are you talking about, e.g. for K=3, using these encodings?

\mathbf{x}_1 = \mbox{level 1 encoding} = (1, -1) \\ \mathbf{x}_2 = \mbox{level 2 encoding} = (-1, 1) \\ \mathbf{x}_3 = \mbox{level 3 encoding} = (-1, -1)

Those encodings don’t sum to zero, and lead to prior correlations of the effects

\alpha_k = \mathbf{\beta}' \mathbf{x}_k

that depend on the pair of effects chosen. However, you got me thinking in terms of encodings instead of covariance matrices. The symmetries I want are

\sum_{k=1}^K \alpha_k = 0 \\ V[\alpha_k] = \mbox{constant},\quad \forall 1\leq k\leq K \\ C[\alpha_j, \alpha_k] = \mbox{constant}, \quad\forall 1\leq j < k\leq K

which leads to the following requirements on the level encodings:

\sum_{k=1}^K \mathbf{x}_k = \mathbf{0} \\ || \mathbf{x}_k ||^2 = \mbox{constant}, \quad 1\leq k\leq K \\ \mathbf{x}'_j \mathbf{x}_k = \mbox{constant}, \quad 1\leq j < k \leq K

Geometrically, this means that we want to find K vectors in (K-1)-dimensional space such that

their mean is the origin;
they all have the same length; and
the angle between any two of these vectors is the same.

Requirements (2) and (3) are just the definition of a regular (K-1)-simplex. This Wikipedia page describes how to construct an arbitrary (K-1)-simplex centered at the origin, with all vertices at a distance of 1 from the origin. It is similar to what you describe:

\mathbf{x}_k = \left(\frac{K}{K-1}\right)^{1/2} \mathbf{e}_k - \frac{K^{1/2}-1}{(K-1)^{3/2}} \mathbf{1}, \quad 1\leq k < K \\ \mathbf{x}_K = -\left(\frac{1}{K-1}\right)^{1/2} \mathbf{1}.

Any permutation of the levels corresponds to some combination of rotations and/or reflections of the encoding vectors, and the LKJ correlation prior is invariant under these transformations (they leave the determinant unchanged), so I think that solves my problem. Thanks!

Kevin_Van_Horn · February 23, 2023, 1:29am

An update for anyone following this thread: the solution I gave in the previous message still doesn’t quite work. The means and variances are invariant under permutation of the levels, but the fourth moments are not. Sigh.

jsocolar · February 23, 2023, 2:57pm

I’ll just mention, since this is still open, that there might be a hack-y solution. In particular, if we take a dummy coding (such that each level is coded by exactly one 1 in a unique column and a bunch of 0s), then it’s trivial to get a prior, but the problem is that the design matrix is now rank-deficient assuming that it also includes an intercept. To eliminate a degree of freedom while preserving the symmetry of the prior, you could add a soft sum-to-zero constraint on the K coefficients: target += std_normal_lpdf(beta | 0, .001). Now we can use any reasonable prior for \Sigma, with the (enormous) caveat that the prior we place on \Sigma won’t actually be the prior on the realized covariance structure, which will emerge via the interaction of \Sigma and the sum-to-zero constraint. This is just a hack to short-circuit the need to actually work out by analysis what the realized prior on the actual covariances is. But I think it should work.

Topic		Replies	Views
Non-negative (or non-positive) constrained prior for coefficients in Hierarchical multinomial logit model Modeling	2	1411	September 7, 2018
Symmetric weakly informative priors for categorical predictors Modeling prior-choice , priors , brms	6	1446	April 29, 2021
Exception: S_n (cov_matrix) not symmetric for large value elements Modeling fitting-issues	10	95	May 5, 2025
Multivariate priors for hierarchical models - Group-level predictors for prior scale Modeling	5	368	November 10, 2023
Hierarchical multinomial model performance with high variance in grouping variable Modeling	2	848	July 20, 2018

Symmetric effects prior for a hierarchical model

Related topics