Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain

I realize this isn’t really establishing much here - we’re comparing different models and different data

maybe @anon75146577 has some insight on when hard sum-to-zero constraint is just fine and when a soft constraint would work better and if so, what scale?

Thanks for running all this. @betanalpha may also have some insight—he was also recommending the soft centering.

1 Like

It’s all going to come down to the geometry. Hard constraints can induce awkward geometries if the data inform the unconstrained parameters in a nonlinear way. At the same time a soft constraint might allow too much slack such that the posterior stretches into unwelcome neighborhoods too much. This is particularly troublesome in high dimensional spaces where a soft constraint has to fight against concentration of measure.

1 Like

could you elaborate on this? examples? rules of thumb? anything that would help users like me who don’t have a very good sense of geometry.

1 Like

It’s hard to give specifics because these problems can manifest is so many different ways. Incidentally this is why the diagnostics are so tricky – it’s easy to spot problems, but because the problems could have been caused by a myriad of pathologies there’s no way to isolate the cause.

Whenever I’m worried about the geometry of a model in which I’m not familiar I take a thorough look at the unconstrained space. Running with diagnostic_file=<file_name> will generate <file_name> in the local directory containing comma separated values of the unconstrained parameters, their momenta, and their gradients for each sample. Pairs plots of the unconstrained parameters gives you a sense of what geometry the sampler is struggling with, and correlations with the gradients can help you identify regions that are hard to explore and may be causing divergences.

3 Likes

@stemangiola @mitzimorris

There might be another way to do this from Fraser CJM 1951.
Something like:

transformed data{
  matrix [N,N] A = rep_matrix(-inv(x-1),N,N);
  for(i in 1:N)
    A[i,i] = 1;
  matrix [N,N] A_sqrt = chol(A);
//or
//matrix [N,N] A_sqrt = eigenvectors_sym(A) * diagonal(sqrt(eigenvalues(A)))* inverse(eigenvectors_sym(A));

}
parameters {
  vector [N-1] x_raw; 
}
transformed parameters{
   x =  A_sqrt * append_row(x,0); //structure of the chol matrix, means that this could actually be done in O(n) time.
}
model {
  x_raw ~ normal(0,1/sqrt(1-inv(N)));
}

Some more references:
https://cms.math.ca/openaccess/cjm/v3/cjm1951v03.0363-0366.pdf
http://r.789695.n4.nabble.com/A-vector-of-normal-distributed-values-with-a-sum-to-zero-constraint-td4687954.html

1 Like

I didn’t understood the Fraser approach in short time.
What speak against using a QR approach?

N <- 10
M <- diag(N)
M <- M - M[,N]
K<-matrix(rnorm((N-1)*100000,0,1/sqrt(1-1/N)),100000,N-1) %*% t(qr.Q(qr(M))[,1:(N-1)])
var(K)
sum(K[1,])

You have to be careful about integer divisions like 1/N in Stan, as they’ll round down to zero.

The multiplication is going to be quadratic in transformed parameters the way it’s written, but then you can’t multiply a vector by a matrix, so maybe x was meant to be a row vector?

I’m not good enough at linear algebra to understand what it does, so thanks for the references.

I’m just not as familiar with that approach. Some quick googling suggested something like that might work[1]. Also, some suggestions about the SVD as well [2]. However I think the variance scaling may be off?

> mean(apply(K,1,var))
[1] 1.109827

It actually looks like once the scaling is correct, the QR approach may be the same as the cholesky approach (and possibly more stable), up to permutation and reflection since:

-(qr.Q(qr(M)) * sqrt(N/(N-1)))[c((N-1):1,N),c((N-1):1)]
is the same as
t(chol(A))[,1:(N-1)]

@bob_carpenter - thanks for the tips - I was just trying to translating the references into stan. I edited the code for future users who come to this thread. You are right that as it is written it is quadratic, but the cholesky factor, at least for the examples I looked at has a structure such that the first i-2 columns of row i are the same as the first i-2 columns of row i-1, so it could be done in \mathcal O(n)

@mitzimorris and @stemangiola, would you be able to share your sample dataset? I’d love to try these out against the benchmarks you have already shared.

[1] https://arxiv.org/pdf/math-ph/0609050.pdf
[2] https://stats.stackexchange.com/questions/159313/generating-samples-from-singular-gaussian-distribution

your program needs a declaration for N and x either in the data or transformed data block. what is x and what’s the connection between x and x_raw?

x_raw is parameter vector on the unconstrained space with N-1 degrees of freedom, and x is the quantity of interest with a hard constraint sum(x) = 0 and var(x[i]) = 1

That R list discussion got off to a vague start with

Anyone knows how to generate a vector of Normal distributed values (for example N(0,0.5)), but with a sum-to-zero constraint??

Did they want the marginals to be normal(0, 0.5)? Someone should have clarified. You wind up with correlation by definition.

When you write the following Stan program:

parameters {
  vector[3] mu_prefix;
}
transformed parameters {
  vector[4] mu = append_row(mu_prefix, -sum(mu_prefix));
}
model {
  mu ~ normal(0, 1);
}

you make the distribution on the constrained (sum to zero) space proportional to this product of four normal densities, but it’s really only over the first three variables (the fourth is defined by the first three). The result isn’t marginally standard normal, but a bit more constrained:

              mean se_mean   sd
mu[1]         0.03    0.02 0.86
mu[2]         0.01    0.02 0.89
mu[3]        -0.03    0.02 0.87
mu[4]         0.00    0.01 0.86

If you just put the distribution on the first three components, then they’re marginally standard normal and the fourth has a much wider distribution:

              mean se_mean   sd
mu[1]        -0.02    0.02 1.00
mu[2]         0.00    0.02 0.98
mu[3]         0.00    0.02 1.01
mu[4]         0.02    0.03 1.75

To get everything to be marginally standard normal, you’d have to work out what the scale needs to be in the sampling statement to get the marginals to work out. Real statisticians know how to solve these things, but I’d get stuck after laying out the integrals.

I went through this kind of reasoning before with ordering, where you get the same thing by taking an ordered variable and giving it a normal distribution as you do by taking independent normals and sorting.

I was mostly referencing this post from that thread, which described the singular covariance matrix that fulfills that constraint.

Following @andre.pfeuffer’s suggestion, which is much more stable than mine, here is a worked example. I also stand corrected about the scaling on the variances, it depends if you care about the marginal variance or the variance of the parameters.

So in your example, where the last parameter is just the negative sum of the previous parameters, it does not have the right variance. But if you pick the right transformation, you can have N-1 free parameters, put the prior on those, and end up with N parameters with the right distribution.

       mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
x[1]   0.00    0.02 1.01 -2.02 -0.68  0.00 0.68  1.99  4000    1
x[2]  -0.02    0.02 1.00 -1.96 -0.70  0.00 0.66  1.93  4000    1
x[3]   0.00    0.02 0.98 -1.90 -0.66 -0.01 0.65  1.91  4000    1
x[4]   0.00    0.02 0.99 -1.97 -0.66  0.00 0.66  1.96  4000    1
x[5]   0.01    0.02 0.99 -1.91 -0.65  0.02 0.66  1.96  4000    1
x[6]   0.00    0.02 1.01 -1.99 -0.66 -0.01 0.66  2.03  4000    1
x[7]  -0.01    0.02 1.01 -2.01 -0.69 -0.02 0.67  1.98  4000    1
x[8]   0.02    0.02 1.00 -1.95 -0.63  0.05 0.70  1.92  4000    1
x[9]   0.00    0.02 1.00 -1.96 -0.66 -0.01 0.66  1.94  4000    1
sum_x  0.00    0.00 0.00  0.00  0.00  0.00 0.00  0.00  4000    1
var_x  1.12    0.01 0.56  0.32  0.70  1.03 1.42  2.44  1987    1
transformed data{
  int N = 9;
  matrix [N,N] A = diag_matrix(rep_vector(1,N));
  matrix [N,N-1] A_qr;
  for(i in 1:N-1)
    A[N,i] = -1;
  A[N,N] = 0;
  A_qr = qr_Q(A)[,1:(N-1)];

}
parameters {
  vector [N-1] x_raw; 
}
transformed parameters{
   vector [N] x =  A_qr * x_raw;
}
model {
  x_raw ~ normal(0,1/sqrt(1-inv(N)));
}

generated quantities{
  real sum_x = sum(x);
  real var_x = variance(x);
}

5 Likes

Cool. I didn’t know covariance matrices were flexible enough for that kind of constraint, but I guess I should learn how to think about matrices as bundles of equations.

Your implementation does give me what I need, the scaling. When I apply that to the constrained implementation in Stan,

transformed data {
  int<lower = 2> N = 4;
}
parameters {
  vector[N - 1] mu_prefix;
}
transformed parameters {
  vector[N] mu = append_row(mu_prefix, -sum(mu_prefix));
}
model {
  mu ~ normal(0, inv(sqrt(1 - inv(N))));
}

I get the correct marginals:

             mean se_mean   sd
mu[1]        -0.04    0.02 1.03
mu[2]         0.02    0.02 1.02
mu[3]         0.04    0.02 1.01
mu[4]        -0.02    0.02 1.01

With more iterations the mean and sd are correct to more decimal places, as expected.

The direct constrained version might be more efficient, because it doesn’t require the matrix multiply A_qr * x_raw, which is a quadratic operation with lots of autodiff to do for the N x N Jacobian. On the other hand, it’s less of a normal sampling and reconstruct approach, which should mix very well.

2 Likes

That matrix multiply is costly, but looking at the QR matrix, I’m fairly certain that it can be done in O(n) operations. I haven’t worked out the math formally, so the following is just based on empirical observation of what works. But you can swap the transformed data and transformed params blocks from above with:

transformed data{
  int N = 9;
  vector [N] Q_r_diag;
  vector [N] Q_r_last;
  for(i in 1:N){
    Q_r_diag[i] = -sqrt((N-i)/(N-i+1.0));
    Q_r_last[i] = inv_sqrt((N-i) * (N-i+1));
  }
  
}
transformed parameters{
   vector [N] x;
   real x_aux = 0;
     
   for(i in 1:N-1){
     x[i] = x_aux + x_raw[i] * Q_r_diag[i];
     x_aux = x_aux + x_raw[i] * Q_r_last[i];
   }
   x[N] = x_aux;
}

And get the same result.

           mean se_mean   sd
    x[1]   0.02    0.02 1.01
    x[2]  -0.01    0.02 0.97
    x[3]  -0.01    0.02 0.98
    x[4]  -0.01    0.02 1.02
    x[5]   0.00    0.02 1.02
    x[6]   0.01    0.02 1.01
    x[7]  -0.01    0.02 0.98
    x[8]   0.00    0.02 1.01
    x[9]   0.01    0.02 1.01
    sum_x  0.00    0.00 0.00
    var_x  1.13    0.01 0.56

But I don’t have any good intuition if this or the fixing the last element approach would work better on actual problems…

2 Likes

Wow. That’s super cool. My linear algebra isn’t up to much more than appreciating the results in a limited way, but I can seriously get behind this computational approach, as it’s clearly only \mathcal{O}(N). The constant overhead’s a bit worse than the sum-to-zero constraint, but I bet the geometry is a lot better.

I think you can vectorize that loop by noticing that the nth value of x_aux is just cumulative_sum(x_raw .* Q_r_last)[n],

transformed parameters {
  vector[N] x;
  x = x_raw .* Q_r_diag;
  x[N] = 0;
  x[2:N] += cumulative_sum(x_raw[1:N-1] .* Q_r_last[1:N-1])
}

It’s a bit obscure and neither cumulative-sum nor elementwise product have efficient built-in analytic derivatives.

1 Like
  vector Q_sum_to_zero_QR(int N) {
    vector [2*N] Q_r;

    for(i in 1:N) {
      Q_r[i] = -sqrt((N-i)/(N-i+1.0));
      Q_r[i+N] = inv_sqrt((N-i) * (N-i+1));
    }
    return Q_r;
  }

  vector sum_to_zero_QR(vector x_raw, vector Q_r) {
    int N = num_elements(x_raw) + 1;
    vector [N] x;
    real x_aux = 0;

    for(i in 1:N-1){
      x[i] = x_aux + x_raw[i] * Q_r[i];
      x_aux = x_aux + x_raw[i] * Q_r[i+N];
    }
    x[N] = x_aux;
    return x;
  }

To whom to use this in the future.
No response required.

4 Likes

Trying to use this right know :-) - Thanx for the effort and sharing. If I understand it correctly, the code should be used as:

transformed data {
  vector[2*N] Q_r = Q_sum_to_zero_QR(N);
  real x_raw_sigma = inv_sqrt(1 - inv(N));
}

parameters {
   vector[N - 1]  x_raw;
}

transformed parameters {
  vector[N] x = sum_to_zero_QR(x_raw, Q_r);
}

model {
  x_raw ~ normal(0, x_raw_sigma);
}

where the non-obvious fact is that x_{raw} \sim N \left( 0, \frac{1}{\sqrt{1 - \frac{1}{N}}} \right)

3 Likes

Yes, you are right about the usage of the functions.
Thanks, very good catch about the sd correction of x_raw. Should the correction moved into sum_to_zero_QR?
To draw x_raw ~ normal(0, 1)?
BTW, the main credits should go to @aaronjg.

1 Like

Hi all. I have implemented Bob’s direct constraint method in a model that I’m hoping to have a quick turnaround on writing into a manuscript. I’m wondering now how to cite it; i.e. is the scale correction published somewhere, or is there a more formal way to credit Aaron and Bob beyond an acknowledgement? I just want to make sure credit is given where it’s due!

I’m a new member on the Stan forums, though I’ve been silently lurking and reading them for quite a while now. So, thank you all!

1 Like