Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain

#21

You have to be careful about integer divisions like 1/N in Stan, as they’ll round down to zero.

The multiplication is going to be quadratic in transformed parameters the way it’s written, but then you can’t multiply a vector by a matrix, so maybe x was meant to be a row vector?

I’m not good enough at linear algebra to understand what it does, so thanks for the references.

#22

I’m just not as familiar with that approach. Some quick googling suggested something like that might work[1]. Also, some suggestions about the SVD as well [2]. However I think the variance scaling may be off?

> mean(apply(K,1,var))
[1] 1.109827


It actually looks like once the scaling is correct, the QR approach may be the same as the cholesky approach (and possibly more stable), up to permutation and reflection since:

-(qr.Q(qr(M)) * sqrt(N/(N-1)))[c((N-1):1,N),c((N-1):1)]
is the same as
t(chol(A))[,1:(N-1)]

@bob_carpenter - thanks for the tips - I was just trying to translating the references into stan. I edited the code for future users who come to this thread. You are right that as it is written it is quadratic, but the cholesky factor, at least for the examples I looked at has a structure such that the first i-2 columns of row i are the same as the first i-2 columns of row i-1, so it could be done in \mathcal O(n)

@mitzimorris and @stemangiola, would you be able to share your sample dataset? I’d love to try these out against the benchmarks you have already shared.

#23

your program needs a declaration for N and x either in the data or transformed data block. what is x and what’s the connection between x and x_raw?

#24

x_raw is parameter vector on the unconstrained space with N-1 degrees of freedom, and x is the quantity of interest with a hard constraint sum(x) = 0 and var(x[i]) = 1

#25

Anyone knows how to generate a vector of Normal distributed values (for example N(0,0.5)), but with a sum-to-zero constraint??

Did they want the marginals to be normal(0, 0.5)? Someone should have clarified. You wind up with correlation by definition.

When you write the following Stan program:

parameters {
vector[3] mu_prefix;
}
transformed parameters {
vector[4] mu = append_row(mu_prefix, -sum(mu_prefix));
}
model {
mu ~ normal(0, 1);
}


you make the distribution on the constrained (sum to zero) space proportional to this product of four normal densities, but it’s really only over the first three variables (the fourth is defined by the first three). The result isn’t marginally standard normal, but a bit more constrained:

              mean se_mean   sd
mu[1]         0.03    0.02 0.86
mu[2]         0.01    0.02 0.89
mu[3]        -0.03    0.02 0.87
mu[4]         0.00    0.01 0.86


If you just put the distribution on the first three components, then they’re marginally standard normal and the fourth has a much wider distribution:

              mean se_mean   sd
mu[1]        -0.02    0.02 1.00
mu[2]         0.00    0.02 0.98
mu[3]         0.00    0.02 1.01
mu[4]         0.02    0.03 1.75


To get everything to be marginally standard normal, you’d have to work out what the scale needs to be in the sampling statement to get the marginals to work out. Real statisticians know how to solve these things, but I’d get stuck after laying out the integrals.

I went through this kind of reasoning before with ordering, where you get the same thing by taking an ordered variable and giving it a normal distribution as you do by taking independent normals and sorting.

#26

I was mostly referencing this post from that thread, which described the singular covariance matrix that fulfills that constraint.

Following @andre.pfeuffer’s suggestion, which is much more stable than mine, here is a worked example. I also stand corrected about the scaling on the variances, it depends if you care about the marginal variance or the variance of the parameters.

So in your example, where the last parameter is just the negative sum of the previous parameters, it does not have the right variance. But if you pick the right transformation, you can have N-1 free parameters, put the prior on those, and end up with N parameters with the right distribution.

       mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
x[1]   0.00    0.02 1.01 -2.02 -0.68  0.00 0.68  1.99  4000    1
x[2]  -0.02    0.02 1.00 -1.96 -0.70  0.00 0.66  1.93  4000    1
x[3]   0.00    0.02 0.98 -1.90 -0.66 -0.01 0.65  1.91  4000    1
x[4]   0.00    0.02 0.99 -1.97 -0.66  0.00 0.66  1.96  4000    1
x[5]   0.01    0.02 0.99 -1.91 -0.65  0.02 0.66  1.96  4000    1
x[6]   0.00    0.02 1.01 -1.99 -0.66 -0.01 0.66  2.03  4000    1
x[7]  -0.01    0.02 1.01 -2.01 -0.69 -0.02 0.67  1.98  4000    1
x[8]   0.02    0.02 1.00 -1.95 -0.63  0.05 0.70  1.92  4000    1
x[9]   0.00    0.02 1.00 -1.96 -0.66 -0.01 0.66  1.94  4000    1
sum_x  0.00    0.00 0.00  0.00  0.00  0.00 0.00  0.00  4000    1
var_x  1.12    0.01 0.56  0.32  0.70  1.03 1.42  2.44  1987    1

transformed data{
int N = 9;
matrix [N,N] A = diag_matrix(rep_vector(1,N));
matrix [N,N-1] A_qr;
for(i in 1:N-1)
A[N,i] = -1;
A[N,N] = 0;
A_qr = qr_Q(A)[,1:(N-1)];

}
parameters {
vector [N-1] x_raw;
}
transformed parameters{
vector [N] x =  A_qr * x_raw;
}
model {
x_raw ~ normal(0,1/sqrt(1-inv(N)));
}

generated quantities{
real sum_x = sum(x);
real var_x = variance(x);
}



#27

Cool. I didn’t know covariance matrices were flexible enough for that kind of constraint, but I guess I should learn how to think about matrices as bundles of equations.

Your implementation does give me what I need, the scaling. When I apply that to the constrained implementation in Stan,

transformed data {
int<lower = 2> N = 4;
}
parameters {
vector[N - 1] mu_prefix;
}
transformed parameters {
vector[N] mu = append_row(mu_prefix, -sum(mu_prefix));
}
model {
mu ~ normal(0, inv(sqrt(1 - inv(N))));
}


I get the correct marginals:

             mean se_mean   sd
mu[1]        -0.04    0.02 1.03
mu[2]         0.02    0.02 1.02
mu[3]         0.04    0.02 1.01
mu[4]        -0.02    0.02 1.01


With more iterations the mean and sd are correct to more decimal places, as expected.

The direct constrained version might be more efficient, because it doesn’t require the matrix multiply A_qr * x_raw, which is a quadratic operation with lots of autodiff to do for the N x N Jacobian. On the other hand, it’s less of a normal sampling and reconstruct approach, which should mix very well.

#28

That matrix multiply is costly, but looking at the QR matrix, I’m fairly certain that it can be done in O(n) operations. I haven’t worked out the math formally, so the following is just based on empirical observation of what works. But you can swap the transformed data and transformed params blocks from above with:

transformed data{
int N = 9;
vector [N] Q_r_diag;
vector [N] Q_r_last;
for(i in 1:N){
Q_r_diag[i] = -sqrt((N-i)/(N-i+1.0));
Q_r_last[i] = inv_sqrt((N-i) * (N-i+1));
}

}
transformed parameters{
vector [N] x;
real x_aux = 0;

for(i in 1:N-1){
x[i] = x_aux + x_raw[i] * Q_r_diag[i];
x_aux = x_aux + x_raw[i] * Q_r_last[i];
}
x[N] = x_aux;
}


And get the same result.

           mean se_mean   sd
x[1]   0.02    0.02 1.01
x[2]  -0.01    0.02 0.97
x[3]  -0.01    0.02 0.98
x[4]  -0.01    0.02 1.02
x[5]   0.00    0.02 1.02
x[6]   0.01    0.02 1.01
x[7]  -0.01    0.02 0.98
x[8]   0.00    0.02 1.01
x[9]   0.01    0.02 1.01
sum_x  0.00    0.00 0.00
var_x  1.13    0.01 0.56


But I don’t have any good intuition if this or the fixing the last element approach would work better on actual problems…

#30

Wow. That’s super cool. My linear algebra isn’t up to much more than appreciating the results in a limited way, but I can seriously get behind this computational approach, as it’s clearly only \mathcal{O}(N). The constant overhead’s a bit worse than the sum-to-zero constraint, but I bet the geometry is a lot better.

I think you can vectorize that loop by noticing that the nth value of x_aux is just cumulative_sum(x_raw .* Q_r_last)[n],

transformed parameters {
vector[N] x;
x = x_raw .* Q_r_diag;
x[N] = 0;
x[2:N] += cumulative_sum(x_raw[1:N-1] .* Q_r_last[1:N-1])
}


It’s a bit obscure and neither cumulative-sum nor elementwise product have efficient built-in analytic derivatives.