Sum to zero constraints and multi-level models - best practise/example code

Hello and happy Friday,

I’ve recently seen some discussion on sum-to-zero constraints and have been wondering if they might help the multi-level model that I’m working on (I’m testing it on simulated data and it sometimes fails to return ground-truth on the fixed effects, but looks like it makes up for it in the random effects). My understanding is that this is one of the identifiability issues that are discussed here:

I’m currently experimenting with adding the soft constraints to my random effects:

  for (kk in 1:(4*K)) {
    sum(u[kk]) ~ normal(0, 0.001);
  }

I was wondering if anybody can recommend any good up-to-date information on this topic. I can see that it comes up now and then in the Stan forums, but it doesn’t seem to be discussed in the Stan mutli-level tutorials that I’ve read.

I also see that there is a new(?) sum_to_zero_vector constrained parameter type. Should I be looking into this instead?

Sorry for all the questions… and I hope I haven’t missed an obvious thread where this is all answered.

Thanks

@mitzimorris just wrote a case study for spatial models—it compares and contrasts no constraints, soft centering with a prior as you are suggesting, using our old sum-to-zero approach, and our new built-in type:

https://mitzimorris.github.io/sum_to_zero_vector/

I think the bottom line is that you just want to use our built-in type, but I may be misremembering what @mitzimorris told me—hopefully she’ll jump in and clarify if I’m mischaracterizing the results.

1 Like

this is now an official Stan Case Study:

3 Likes

Super helpful.

Would you consider this approach to be an appropriate default whenever one is coding up a multi level model? My impression is that these constraints, hard or soft, aren’t wildly used (at least in my field when people are sharing code).

yes!

this came up recently for an IRT model discussed on Andrew Gelman’s blog: https://statmodeling.stat.columbia.edu/2025/04/14/no-vehicles-in-the-park-a-multilevel-model-computing-saga/

Thanks.

I think the missing part of the puzzle is how to integrate this with a variance-covariance matrix.

For example, in my model I have K conditions (usually 2, sometimes 3), and 4 features. Therefore I end up with 4*K parameters (per person).

At present, I define matrix[4*K,L] z_u;:


parameters {

  array[K] real b_a; // weights for class A compared to B  
  array[K] real b_s; // stick-switch rates 
  array[K] real<lower = 0> rho_delta; // distance tuning
  array[K] real rho_psi; // direction tuning

  // random effect variances: 
  // 4*K as we have four fixed effect parameters x K conditions
  vector<lower=0>[4*K] sigma_u;
  cholesky_factor_corr[4*K] L_u;
  // random effect matrix
  matrix[4*K,L] z_u; 
  
}

transformed parameters {

  // this transform random effects so that they have the correlation
  // matrix specified by the correlation matrix above
  matrix[4*K, L] u = diag_pre_multiply(sigma_u, L_u) * z_u;

  // create empty arrays for everything
  array[K] vector[L] u_a, u_stick, u_delta, u_psi;
  // calculate
  for (kk in 1:K) {
    u_a[kk]     = to_vector(b_a[kk]       + u[4*(kk-1)+1]);
    u_stick[kk] = to_vector(b_s[kk]       + u[4*(kk-1)+2]);
    u_delta[kk] = to_vector(rho_delta[kk] + u[4*(kk-1)+3]);
    u_psi[kk]   = to_vector(rho_psi[kk]   + u[4*(kk-1)+4]);
  }
}

I suppose I could try defining array[L] sum_to_zero_vector[4*K] z_u; but I’m not sure how easy it will be to convert z_u to a matrix for the multiplication in transformed parameters {}.

Perhaps the best way forward is to simply remove L_u from the model and skip over estimating all those correlations.

I feel like I am missing something?

I couldn’t follow the terminology in the following without seeing the data block and likelihood.

  // random effect variances: 
  // 4*K as we have four fixed effect parameters x K conditions
  vector<lower=0>[4*K] sigma_u;
  cholesky_factor_corr[4*K] L_u;
  // random effect matrix
  matrix[4*K,L] z_u; 

What is a “fixed effect parameter”?

If you have a varying intercept and varying slope model, you can set up the intercepts to sum to zero to eliminate an additive non-identifiability with an intercept. Then you can just let the slopes vary normally. Just build up the vectors separately, then combine them to apply a multivariate prior.

Sorry for the imprecise language.. the terminology around all this stuff has never been my strength.

I suppose I am asking how to write a model in which I account for the fact that the varying intercepts may be correlated with the varying slopes. Is there a way to concatenate two (or more) sum_to_zero vectors into a matrix?

For example, how would I ingetrate the sum_to_zero constraints with the following:

parameters {
  vector[2] beta;                   // fixed-effects parameters
  real<lower=0> sigma_e;            // residual std
  vector<lower=0>[2] sigma_u;       // random effects standard deviations

  // declare L_u to be the Choleski factor of a 2x2 correlation matrix
  cholesky_factor_corr[2] L_u;

  matrix[2,J] z_u;                  // random effect matrix
}

transformed parameters {
  // this transform random effects so that they have the correlation
  // matrix specified by the correlation matrix above
  matrix[2,J] u;
  u = diag_pre_multiply(sigma_u, L_u) * z_u;

}

Thanks again and sorry if I am being slow and missing something!

I’m not entirely sure I understood correctly what you want, if not, I’m sorry for the lengthy reply…

The way I tend to think about the zero sum constraints in hierarchical models is using a decomposition of the normal distribution:

Let’s say we have a iid vector variable (coefficients in a regression model for a categorical predictor most of the time) with a diagonal normal distribution:

y \sim N(0, \sigma^2I)

Then we can decompose this into the sample mean \bar{y} = \frac{1}{N} e^Ty, where e = (1, 1, 1...)^T and deviations from that sample mean \delta = y - \bar{y}e. We can show that \bar{y} \sim N(0, \frac{\sigma^2}{N}) and \delta \sim N(0, \sigma^2(I - \frac{1}{N}ee^T)). We can use sum_to_zero_vector to efficiently sample from \delta.

In a regression we can use this decomposition and to absorb the variance of \bar{y} into the intercept, or some other variable. Let’s say to keep it simple we have

\text{intercept} \sim N(0, 10^2)\\ y \sim N(0, \sigma^2I)\\ \text{data}_i \sim N(\text{intercept} + y_{j_i}, \sigma_l^2)

(Maybe I shouldn’t have used y as variable, but I’m not changing it now…)

We can rewrite this as

\text{intercept} \sim N(0, 10^2)\\ \bar{y} \sim N(0, \frac{\sigma^2}{N})\\ \delta \sim N(0, \sigma^2(I - \frac{1}{N}ee^T))\\ \text{data}_i \sim N(\text{intercept} + \bar{y} + \delta_{j_i}, \sigma_l^2)

This shows the overparametrization very nicely: the likelihood only depends on \text{intercept} + \bar{y}. But since both are scalar normal distributions, we can combine them into one variable:

\text{intercept_plus_y_bar} \sim N(0, 10^2 + \frac{\sigma^2}{N})\\ \delta \sim N(0, \sigma^2(I - \frac{1}{N}ee^T))\\ \text{data}_i \sim N(\text{intercept_plus_y_bar} + \delta_{j_i}, \sigma_l^2).

Or in stan something like:

parameters {
    real intercept_plus_y_bar;
    sum_to_zero_vector[N] delta;
    real<lower=0> sigma;
}
transformed parameter {
    real intercept_sigma = 10;
    real intercept_plus_y_bar_sigma = sqrt(intercept_sigma^2 + sigma^2 / N);
}
model {
    intercept_plus_y_bar ~ normal(0, intercept_plus_y_bar_sigma);
    delta ~ normal(0, sigma);
    // Likelihood based on delta and intercept_plus_y_bar
    ...
}

Edit I just noticed a problem in the generalization, which means the following isn’t correct…
In the general case \bar{y} and \delta are not independent, which means we can’t sample using this decomposition.

Edit 2 I think the decomposition

y = et + \eta

where

t \sim N(0, (e^T\Sigma^{-1}e)^{-1})) \\ \eta \sim N\left(0, \Sigma - \frac{\Sigma e e^T \Sigma}{e^T \Sigma^{-1} e}\right)

might work, but I’m not 100% sure yet…


With that view, we can generalize this to y with a multivariate normal distribution:

y \sim N(0, \Sigma)

Again, we define

\bar{y} = \frac{1}{N} e^Ty\\ \delta = y - \bar{y}e

and get

\bar{y} \sim N(0, \frac{1}{N^2}e^T\Sigma e)\\ \delta \sim N(0, (I - \frac{1}{N}ee^T)\Sigma(I-\frac{1}{N}ee^T))

Unfortunately I don’t know how we could use the sum_to_zero_vector to simplify things this time (but maybe there is a way?), so we’ll have to do it manually:

Given a cholesky factorization of \Sigma=LL^T, we can sample from \delta as follows:

  1. Construct a Householder vector:

    • u = \frac{e}{\sqrt{N}} + e_1 = (1+\sqrt{N}, 1, 1, ..., 1)/\sqrt{N}
    • \beta = \frac{2}{u^Tu} (normalization factor)
  2. Compute the Cholesky decomposition:

    • \Sigma = LL^T where L is lower triangular

Sampling Procedure:

  1. Generate z \sim N(0, I_{N-1}) (standard normal in \mathbb{R}^{N-1})

  2. Extend z to an N-vector with a leading zero:

    • \tilde{z} = (0, z_1, z_2, ..., z_{N-1})
  3. Apply the Householder transformation:

    • w = \tilde{z} - \beta \cdot u \cdot (u^T\tilde{z})
    • This effectively projects onto the subspace orthogonal to e
  4. Apply the Cholesky factor:

    • \delta = Lw
  5. To obtain the full sample y:

    • Generate \eta \sim N(0, 1) (standard normal scalar)
    • Compute \bar{y} = \eta \sqrt{\frac{1}{N^2}e^T \Sigma e}
    • Compute y = \bar{y}e + \delta

I hope I didn’t completely miss the mark with your question :-)

Yes. You provide the non-centered version of this which is usually faster. Specifically:

parameters {
  sum_to_zero_vector[J] z_u1, z_u2;
  
transformed parameters {
  matrix[2, J] u = diag_pre_multiply(sigma_u, L_u) * append_col(z_u1, z_u2);

The only tricky part is that you want to give the sum_to_zero_vector values a standard normal marginal distribution, which you can do with a slight adjustment to the standard unit scale.

transformed data {
  real<lower=0> sigma_u = sqrt(J / (J - 1.0));  // the ".0" is important to cast to real
  ...
model {
  z_u1 ~ normal(0, sigma_u);
  z_u2 ~ normal(0, sigma_u);
1 Like