Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain


You have to be careful about integer divisions like 1/N in Stan, as they’ll round down to zero.

The multiplication is going to be quadratic in transformed parameters the way it’s written, but then you can’t multiply a vector by a matrix, so maybe x was meant to be a row vector?

I’m not good enough at linear algebra to understand what it does, so thanks for the references.


I’m just not as familiar with that approach. Some quick googling suggested something like that might work[1]. Also, some suggestions about the SVD as well [2]. However I think the variance scaling may be off?

> mean(apply(K,1,var))
[1] 1.109827

It actually looks like once the scaling is correct, the QR approach may be the same as the cholesky approach (and possibly more stable), up to permutation and reflection since:

-(qr.Q(qr(M)) * sqrt(N/(N-1)))[c((N-1):1,N),c((N-1):1)]
is the same as

@bob_carpenter - thanks for the tips - I was just trying to translating the references into stan. I edited the code for future users who come to this thread. You are right that as it is written it is quadratic, but the cholesky factor, at least for the examples I looked at has a structure such that the first i-2 columns of row i are the same as the first i-2 columns of row i-1, so it could be done in \mathcal O(n)

@mitzimorris and @stemangiola, would you be able to share your sample dataset? I’d love to try these out against the benchmarks you have already shared.

[1] https://arxiv.org/pdf/math-ph/0609050.pdf
[2] https://stats.stackexchange.com/questions/159313/generating-samples-from-singular-gaussian-distribution


your program needs a declaration for N and x either in the data or transformed data block. what is x and what’s the connection between x and x_raw?


x_raw is parameter vector on the unconstrained space with N-1 degrees of freedom, and x is the quantity of interest with a hard constraint sum(x) = 0 and var(x[i]) = 1


That R list discussion got off to a vague start with

Anyone knows how to generate a vector of Normal distributed values (for example N(0,0.5)), but with a sum-to-zero constraint??

Did they want the marginals to be normal(0, 0.5)? Someone should have clarified. You wind up with correlation by definition.

When you write the following Stan program:

parameters {
  vector[3] mu_prefix;
transformed parameters {
  vector[4] mu = append_row(mu_prefix, -sum(mu_prefix));
model {
  mu ~ normal(0, 1);

you make the distribution on the constrained (sum to zero) space proportional to this product of four normal densities, but it’s really only over the first three variables (the fourth is defined by the first three). The result isn’t marginally standard normal, but a bit more constrained:

              mean se_mean   sd
mu[1]         0.03    0.02 0.86
mu[2]         0.01    0.02 0.89
mu[3]        -0.03    0.02 0.87
mu[4]         0.00    0.01 0.86

If you just put the distribution on the first three components, then they’re marginally standard normal and the fourth has a much wider distribution:

              mean se_mean   sd
mu[1]        -0.02    0.02 1.00
mu[2]         0.00    0.02 0.98
mu[3]         0.00    0.02 1.01
mu[4]         0.02    0.03 1.75

To get everything to be marginally standard normal, you’d have to work out what the scale needs to be in the sampling statement to get the marginals to work out. Real statisticians know how to solve these things, but I’d get stuck after laying out the integrals.

I went through this kind of reasoning before with ordering, where you get the same thing by taking an ordered variable and giving it a normal distribution as you do by taking independent normals and sorting.


I was mostly referencing this post from that thread, which described the singular covariance matrix that fulfills that constraint.

Following @andre.pfeuffer’s suggestion, which is much more stable than mine, here is a worked example. I also stand corrected about the scaling on the variances, it depends if you care about the marginal variance or the variance of the parameters.

So in your example, where the last parameter is just the negative sum of the previous parameters, it does not have the right variance. But if you pick the right transformation, you can have N-1 free parameters, put the prior on those, and end up with N parameters with the right distribution.

       mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
x[1]   0.00    0.02 1.01 -2.02 -0.68  0.00 0.68  1.99  4000    1
x[2]  -0.02    0.02 1.00 -1.96 -0.70  0.00 0.66  1.93  4000    1
x[3]   0.00    0.02 0.98 -1.90 -0.66 -0.01 0.65  1.91  4000    1
x[4]   0.00    0.02 0.99 -1.97 -0.66  0.00 0.66  1.96  4000    1
x[5]   0.01    0.02 0.99 -1.91 -0.65  0.02 0.66  1.96  4000    1
x[6]   0.00    0.02 1.01 -1.99 -0.66 -0.01 0.66  2.03  4000    1
x[7]  -0.01    0.02 1.01 -2.01 -0.69 -0.02 0.67  1.98  4000    1
x[8]   0.02    0.02 1.00 -1.95 -0.63  0.05 0.70  1.92  4000    1
x[9]   0.00    0.02 1.00 -1.96 -0.66 -0.01 0.66  1.94  4000    1
sum_x  0.00    0.00 0.00  0.00  0.00  0.00 0.00  0.00  4000    1
var_x  1.12    0.01 0.56  0.32  0.70  1.03 1.42  2.44  1987    1
transformed data{
  int N = 9;
  matrix [N,N] A = diag_matrix(rep_vector(1,N));
  matrix [N,N-1] A_qr;
  for(i in 1:N-1)
    A[N,i] = -1;
  A[N,N] = 0;
  A_qr = qr_Q(A)[,1:(N-1)];

parameters {
  vector [N-1] x_raw; 
transformed parameters{
   vector [N] x =  A_qr * x_raw;
model {
  x_raw ~ normal(0,1/sqrt(1-inv(N)));

generated quantities{
  real sum_x = sum(x);
  real var_x = variance(x);


Cool. I didn’t know covariance matrices were flexible enough for that kind of constraint, but I guess I should learn how to think about matrices as bundles of equations.

Your implementation does give me what I need, the scaling. When I apply that to the constrained implementation in Stan,

transformed data {
  int<lower = 2> N = 4;
parameters {
  vector[N - 1] mu_prefix;
transformed parameters {
  vector[N] mu = append_row(mu_prefix, -sum(mu_prefix));
model {
  mu ~ normal(0, inv(sqrt(1 - inv(N))));

I get the correct marginals:

             mean se_mean   sd
mu[1]        -0.04    0.02 1.03
mu[2]         0.02    0.02 1.02
mu[3]         0.04    0.02 1.01
mu[4]        -0.02    0.02 1.01

With more iterations the mean and sd are correct to more decimal places, as expected.

The direct constrained version might be more efficient, because it doesn’t require the matrix multiply A_qr * x_raw, which is a quadratic operation with lots of autodiff to do for the N x N Jacobian. On the other hand, it’s less of a normal sampling and reconstruct approach, which should mix very well.


That matrix multiply is costly, but looking at the QR matrix, I’m fairly certain that it can be done in O(n) operations. I haven’t worked out the math formally, so the following is just based on empirical observation of what works. But you can swap the transformed data and transformed params blocks from above with:

transformed data{
  int N = 9;
  vector [N] Q_r_diag;
  vector [N] Q_r_last;
  for(i in 1:N){
    Q_r_diag[i] = -sqrt((N-i)/(N-i+1.0));
    Q_r_last[i] = inv_sqrt((N-i) * (N-i+1));
transformed parameters{
   vector [N] x;
   real x_aux = 0;
   for(i in 1:N-1){
     x[i] = x_aux + x_raw[i] * Q_r_diag[i];
     x_aux = x_aux + x_raw[i] * Q_r_last[i];
   x[N] = x_aux;

And get the same result.

           mean se_mean   sd
    x[1]   0.02    0.02 1.01
    x[2]  -0.01    0.02 0.97
    x[3]  -0.01    0.02 0.98
    x[4]  -0.01    0.02 1.02
    x[5]   0.00    0.02 1.02
    x[6]   0.01    0.02 1.01
    x[7]  -0.01    0.02 0.98
    x[8]   0.00    0.02 1.01
    x[9]   0.01    0.02 1.01
    sum_x  0.00    0.00 0.00
    var_x  1.13    0.01 0.56

But I don’t have any good intuition if this or the fixing the last element approach would work better on actual problems…


Wow. That’s super cool. My linear algebra isn’t up to much more than appreciating the results in a limited way, but I can seriously get behind this computational approach, as it’s clearly only \mathcal{O}(N). The constant overhead’s a bit worse than the sum-to-zero constraint, but I bet the geometry is a lot better.

I think you can vectorize that loop by noticing that the nth value of x_aux is just cumulative_sum(x_raw .* Q_r_last)[n],

transformed parameters {
  vector[N] x;
  x = x_raw .* Q_r_diag;
  x[N] = 0;
  x[2:N] += cumulative_sum(x_raw[1:N-1] .* Q_r_last[1:N-1])

It’s a bit obscure and neither cumulative-sum nor elementwise product have efficient built-in analytic derivatives.

  vector Q_sum_to_zero_QR(int N) {
    vector [2*N] Q_r;

    for(i in 1:N) {
      Q_r[i] = -sqrt((N-i)/(N-i+1.0));
      Q_r[i+N] = inv_sqrt((N-i) * (N-i+1));
    return Q_r;

  vector sum_to_zero_QR(vector x_raw, vector Q_r) {
    int N = num_elements(x_raw) + 1;
    vector [N] x;
    real x_aux = 0;

    for(i in 1:N-1){
      x[i] = x_aux + x_raw[i] * Q_r[i];
      x_aux = x_aux + x_raw[i] * Q_r[i+N];
    x[N] = x_aux;
    return x;

To whom to use this in the future.
No response required.


Trying to use this right know :-) - Thanx for the effort and sharing. If I understand it correctly, the code should be used as:

transformed data {
  vector[2*N] Q_r = Q_sum_to_zero_QR(N);
  real x_raw_sigma = inv_sqrt(1 - inv(N));

parameters {
   vector[N - 1]  x_raw;

transformed parameters {
  vector[N] x = sum_to_zero_QR(x_raw, Q_r);

model {
  x_raw ~ normal(0, x_raw_sigma);

where the non-obvious fact is that x_{raw} \sim N \left( 0, \frac{1}{\sqrt{1 - \frac{1}{N}}} \right)


Yes, you are right about the usage of the functions.
Thanks, very good catch about the sd correction of x_raw. Should the correction moved into sum_to_zero_QR?
To draw x_raw ~ normal(0, 1)?
BTW, the main credits should go to @aaronjg.


Hi all. I have implemented Bob’s direct constraint method in a model that I’m hoping to have a quick turnaround on writing into a manuscript. I’m wondering now how to cite it; i.e. is the scale correction published somewhere, or is there a more formal way to credit Aaron and Bob beyond an acknowledgement? I just want to make sure credit is given where it’s due!

I’m a new member on the Stan forums, though I’ve been silently lurking and reading them for quite a while now. So, thank you all!


Definitely not me—I still do linear algebra on my finters. I certainly haven’t published anything. I’m pretty sure @bgoodri is behind the QR decompositions for conditioning in both the Stan manual and in rstanarm.


I don’t think there is a more formal way to cite this? Maybe just cite this page if you can cite URLs, or as “personal communication” if you need to cite it. I wouldn’t be offended if you use this uncited though. I haven’t done any testing on this implementation beyond what’s in the thread and would be happy to see it used in a more formal way!


Ok, thanks. I’m not using the QR version, so it’s really just the scale adjustment and the general idea of putting the same prior on the summed parameters as on the raw ones that I felt I should have to justify. (I would love to try the QR stuff, but as a complete beginner and with a ticking clock on my dissertation, a model that appears to ‘work’ with raw parameters that are easier to interpret is a compromise that I’ve decided to make for the moment).

I do see now that you’ve put these recommendations into the manual v2.18.0, so I’m sure I could cite that, too.

In case you’re curious, my model is just a logistic regression with some slight complications that incorporate phylogenetic covariance while allowing for various on-the-fly transformations of the phylogenetic branch lengths. The hard sum-to-zero constraint was producing clearly nonsensical estimates for the ‘reference level’ when I only had the prior on the raw parameters.

It’s odd to me, though, if this is somewhat of a novel problem - how did linear modelers using software like BUGS, JAGS, and MCMCglmm solve this? Always choosing between ‘treatment contrasts’ or soft constraints?


In that case, I take no responsibility :)

This may suggest that there is something off on your model implementation. I’ve normally noticed hard vs. soft sum to zero constraints affect model run time, but do not radically alter the results if both are converging properly.

I’m not sure in this particular case, but there are many models that are actually pathological but can appear to work fine with Gibbs sampling because they are never actually run until convergence. With Stan, the pathologies can become apparent much more quickly.


Exactly—this was one of the main motivations in building Stan—to provide more flexibility and be able to fit a wider class of models (though it turns out there are some they can fit that we can’t, like missing count data).


Oh, I forgot to add that Yuling has been working on this. See this Discourse topic on soft contraints. I still haven’t digested it all yet, so I’m afraid I can’t summarize.


But you posted the scale correction 1/sqrt(1-inv(N))), right?

Indeed, I had coded the model without adding the negative sum to the vector that had the prior - so only the N-1 free parameters were constrained with the proper scale (as pointed out by Bob). With a vector of length 100+, I believe such a construction made it so that the effective prior on the ‘reference level’ had a scale >100 times greater than the prior on all other levels? (Pondering your correction factor helped me logic that out). However, that was the method suggested in the manual (2.17.0) section 9.7, ‘K-1 degrees of freedom’.

Although manual 2.18.0 does additionally include your method for symmetrical and standard-scaled marginals, I wonder if it’d be good to also explicitly state that the ‘different posterior mode’ of the other method (placing the prior on beta_raw instead of beta) isn’t necessarily subtle and can be /extremely/ asymmetrical, especially when the vector is long. Pardon me if this isn’t really the place to make those suggestions (not sure who actually does the manual…).