Test: Soft vs Hard sum-to-zero constrain + choosing the right prior for soft constrain

Cool. I didn’t know covariance matrices were flexible enough for that kind of constraint, but I guess I should learn how to think about matrices as bundles of equations.

Your implementation does give me what I need, the scaling. When I apply that to the constrained implementation in Stan,

transformed data {
  int<lower = 2> N = 4;
}
parameters {
  vector[N - 1] mu_prefix;
}
transformed parameters {
  vector[N] mu = append_row(mu_prefix, -sum(mu_prefix));
}
model {
  mu ~ normal(0, inv(sqrt(1 - inv(N))));
}

I get the correct marginals:

             mean se_mean   sd
mu[1]        -0.04    0.02 1.03
mu[2]         0.02    0.02 1.02
mu[3]         0.04    0.02 1.01
mu[4]        -0.02    0.02 1.01

With more iterations the mean and sd are correct to more decimal places, as expected.

The direct constrained version might be more efficient, because it doesn’t require the matrix multiply A_qr * x_raw, which is a quadratic operation with lots of autodiff to do for the N x N Jacobian. On the other hand, it’s less of a normal sampling and reconstruct approach, which should mix very well.

2 Likes

That matrix multiply is costly, but looking at the QR matrix, I’m fairly certain that it can be done in O(n) operations. I haven’t worked out the math formally, so the following is just based on empirical observation of what works. But you can swap the transformed data and transformed params blocks from above with:

transformed data{
  int N = 9;
  vector [N] Q_r_diag;
  vector [N] Q_r_last;
  for(i in 1:N){
    Q_r_diag[i] = -sqrt((N-i)/(N-i+1.0));
    Q_r_last[i] = inv_sqrt((N-i) * (N-i+1));
  }
  
}
transformed parameters{
   vector [N] x;
   real x_aux = 0;
     
   for(i in 1:N-1){
     x[i] = x_aux + x_raw[i] * Q_r_diag[i];
     x_aux = x_aux + x_raw[i] * Q_r_last[i];
   }
   x[N] = x_aux;
}

And get the same result.

           mean se_mean   sd
    x[1]   0.02    0.02 1.01
    x[2]  -0.01    0.02 0.97
    x[3]  -0.01    0.02 0.98
    x[4]  -0.01    0.02 1.02
    x[5]   0.00    0.02 1.02
    x[6]   0.01    0.02 1.01
    x[7]  -0.01    0.02 0.98
    x[8]   0.00    0.02 1.01
    x[9]   0.01    0.02 1.01
    sum_x  0.00    0.00 0.00
    var_x  1.13    0.01 0.56

But I don’t have any good intuition if this or the fixing the last element approach would work better on actual problems…

2 Likes

Wow. That’s super cool. My linear algebra isn’t up to much more than appreciating the results in a limited way, but I can seriously get behind this computational approach, as it’s clearly only \mathcal{O}(N). The constant overhead’s a bit worse than the sum-to-zero constraint, but I bet the geometry is a lot better.

I think you can vectorize that loop by noticing that the nth value of x_aux is just cumulative_sum(x_raw .* Q_r_last)[n],

transformed parameters {
  vector[N] x;
  x = x_raw .* Q_r_diag;
  x[N] = 0;
  x[2:N] += cumulative_sum(x_raw[1:N-1] .* Q_r_last[1:N-1])
}

It’s a bit obscure and neither cumulative-sum nor elementwise product have efficient built-in analytic derivatives.

1 Like
  vector Q_sum_to_zero_QR(int N) {
    vector [2*N] Q_r;

    for(i in 1:N) {
      Q_r[i] = -sqrt((N-i)/(N-i+1.0));
      Q_r[i+N] = inv_sqrt((N-i) * (N-i+1));
    }
    return Q_r;
  }

  vector sum_to_zero_QR(vector x_raw, vector Q_r) {
    int N = num_elements(x_raw) + 1;
    vector [N] x;
    real x_aux = 0;

    for(i in 1:N-1){
      x[i] = x_aux + x_raw[i] * Q_r[i];
      x_aux = x_aux + x_raw[i] * Q_r[i+N];
    }
    x[N] = x_aux;
    return x;
  }

To whom to use this in the future.
No response required.

4 Likes

Trying to use this right know :-) - Thanx for the effort and sharing. If I understand it correctly, the code should be used as:

transformed data {
  vector[2*N] Q_r = Q_sum_to_zero_QR(N);
  real x_raw_sigma = inv_sqrt(1 - inv(N));
}

parameters {
   vector[N - 1]  x_raw;
}

transformed parameters {
  vector[N] x = sum_to_zero_QR(x_raw, Q_r);
}

model {
  x_raw ~ normal(0, x_raw_sigma);
}

where the non-obvious fact is that x_{raw} \sim N \left( 0, \frac{1}{\sqrt{1 - \frac{1}{N}}} \right)

3 Likes

Yes, you are right about the usage of the functions.
Thanks, very good catch about the sd correction of x_raw. Should the correction moved into sum_to_zero_QR?
To draw x_raw ~ normal(0, 1)?
BTW, the main credits should go to @aaronjg.

1 Like

Hi all. I have implemented Bob’s direct constraint method in a model that I’m hoping to have a quick turnaround on writing into a manuscript. I’m wondering now how to cite it; i.e. is the scale correction published somewhere, or is there a more formal way to credit Aaron and Bob beyond an acknowledgement? I just want to make sure credit is given where it’s due!

I’m a new member on the Stan forums, though I’ve been silently lurking and reading them for quite a while now. So, thank you all!

1 Like

Definitely not me—I still do linear algebra on my finters. I certainly haven’t published anything. I’m pretty sure @bgoodri is behind the QR decompositions for conditioning in both the Stan manual and in rstanarm.

I don’t think there is a more formal way to cite this? Maybe just cite this page if you can cite URLs, or as “personal communication” if you need to cite it. I wouldn’t be offended if you use this uncited though. I haven’t done any testing on this implementation beyond what’s in the thread and would be happy to see it used in a more formal way!

Ok, thanks. I’m not using the QR version, so it’s really just the scale adjustment and the general idea of putting the same prior on the summed parameters as on the raw ones that I felt I should have to justify. (I would love to try the QR stuff, but as a complete beginner and with a ticking clock on my dissertation, a model that appears to ‘work’ with raw parameters that are easier to interpret is a compromise that I’ve decided to make for the moment).

I do see now that you’ve put these recommendations into the manual v2.18.0, so I’m sure I could cite that, too.

In case you’re curious, my model is just a logistic regression with some slight complications that incorporate phylogenetic covariance while allowing for various on-the-fly transformations of the phylogenetic branch lengths. The hard sum-to-zero constraint was producing clearly nonsensical estimates for the ‘reference level’ when I only had the prior on the raw parameters.

It’s odd to me, though, if this is somewhat of a novel problem - how did linear modelers using software like BUGS, JAGS, and MCMCglmm solve this? Always choosing between ‘treatment contrasts’ or soft constraints?

In that case, I take no responsibility :)

This may suggest that there is something off on your model implementation. I’ve normally noticed hard vs. soft sum to zero constraints affect model run time, but do not radically alter the results if both are converging properly.

I’m not sure in this particular case, but there are many models that are actually pathological but can appear to work fine with Gibbs sampling because they are never actually run until convergence. With Stan, the pathologies can become apparent much more quickly.

1 Like

Exactly—this was one of the main motivations in building Stan—to provide more flexibility and be able to fit a wider class of models (though it turns out there are some they can fit that we can’t, like missing count data).

1 Like

Oh, I forgot to add that Yuling has been working on this. See this Discourse topic on soft contraints. I still haven’t digested it all yet, so I’m afraid I can’t summarize.

1 Like

But you posted the scale correction 1/sqrt(1-inv(N))), right?

Indeed, I had coded the model without adding the negative sum to the vector that had the prior - so only the N-1 free parameters were constrained with the proper scale (as pointed out by Bob). With a vector of length 100+, I believe such a construction made it so that the effective prior on the ‘reference level’ had a scale >100 times greater than the prior on all other levels? (Pondering your correction factor helped me logic that out). However, that was the method suggested in the manual (2.17.0) section 9.7, ‘K-1 degrees of freedom’.

Although manual 2.18.0 does additionally include your method for symmetrical and standard-scaled marginals, I wonder if it’d be good to also explicitly state that the ‘different posterior mode’ of the other method (placing the prior on beta_raw instead of beta) isn’t necessarily subtle and can be /extremely/ asymmetrical, especially when the vector is long. Pardon me if this isn’t really the place to make those suggestions (not sure who actually does the manual…).

Are you referring to this post?

That was in the context of QR decomposition, so I’m not exactly sure how you are using it without using the QR decomposition.

I’m not sure I totally follow what is going on here, It sounds like you are actually exploring different models, rather than different parameterizations of the same model, and so this would be why you are getting different results, rather than landing on different posterior modes of the same model.

I’m essentially using the parameterization written by Bob in this post:

Which directly followed the one you’re referring to. I somehow missed this post:

Where you do cite a paper for the method (so I could have answered my first question on my own, I guess…)

Anyway, yes, I realize that your QR parameterization and Bob’s parameterization are both different models when compared to the model originally written in the manual, which suggested placing the prior only on beta_raw.

I’m sorry I haven’t been clear; I didn’t mean to hijack this thread.

thanks for this. the links here are broken, I checked the main reference https://www.cambridge.org/core/services/aop-cambridge-core/content/view/859F60F5AD4B81B03F65FB083E646BC5/S0008414X00031114a.pdf/normal-samples-with-linear-constraints-and-given-variances.pdf and it’s still really unclear, i know this thread is very old and this is asking for a lot, but if there’s an off chance you remember something about what you were trying to do in the code here, that would be really helpful!

Is it normal that the implementations here

and in the user guide

Stan User’s Guide

produce different results? Are they both correct but proposing different mapping? Am I overlooking anything?

functions{

 // https://discourse.mc-stan.org/t/test-soft-vs-hard-sum-to-zero-constrain-choosing-the-right-prior-for-soft-constrain/3884/31?u=stemangiola
 vector Q_sum_to_zero_QR(int N) {
    vector [2*N] Q_r;

    for(i in 1:N) {
      Q_r[i] = -sqrt((N-i)/(N-i+1.0));
      Q_r[i+N] = inv_sqrt((N-i) * (N-i+1));
    }
    return Q_r;
  }

  vector sum_to_zero_QR_discourse(vector x_raw, vector Q_r) {
    int N = num_elements(x_raw) + 1;
    vector [N] x;
    real x_aux = 0;

    for(i in 1:N-1){
      x[i] = x_aux + x_raw[i] * Q_r[i];
      x_aux = x_aux + x_raw[i] * Q_r[i+N];
    }
    x[N] = x_aux;
    return x;
  }

 // https://mc-stan.org/docs/2_18/stan-users-guide/parameterizing-centered-vectors.html
  matrix get_A_qr(int K){
  	matrix[K, K] A = diag_matrix(rep_vector(1,K));
  
		for (i in 1:K-1) A[K,i] = -1;
  	A[K,K] = 0;
  	
  	return qr_Q(A)[ , 1:(K-1)];
  
  }
  
  vector sum_to_zero_QR_guide(vector x_raw, matrix A_qr) {
    
    return A_qr * x_raw;
  }


}

transformed data{
	int K = 10;
	vector[K-1] x_raw = [-1.89557863 , 1.00724850, -0.02504404, -1.81977322,  0.29557215,  1.30002674,  0.54844423,  1.60936596, -0.32222849]'; // runif(n = 9, min = -2, max = 2)
	vector[K*2] Q_r_1 = Q_sum_to_zero_QR(K);
	matrix[K, K-1] A_qr = get_A_qr(K);
	
	
	print("1 ----", sum_to_zero_QR_discourse(x_raw, Q_r_1));
	print("2 ----", sum_to_zero_QR_guide(x_raw, A_qr));
	
}
mod$sample(init = 1, fixed_param = TRUE)
Running MCMC with 4 sequential chains...

Chain 1 1 ----[1.7983,-1.14945,-0.0576796,1.60033,-0.635069,-1.47407,-0.495558,-1.17631,1.0226,0.566902] 
Chain 1 2 ----[-1.82099,0.753007,-0.0983261,-2.11121,-0.159768,0.974602,0.357323,1.55129,-0.305693,0.859761] 
2 Likes

Could someone please explain or provide an accessible reference as to why inv_sqrt(1 - inv(N)) is ‘right’ scale value to use in this situation on the free parameters to give a marginal standard normal distribution on the quantities of interest?

It is because the simplifed code was meant to replace parts of the code mentioned at:

Where the scaling was used. The link in the post is dead, but @mjhajharia provided an updated reference outlining the general approach: Fraser 1951, Normal samples with linear constraints and given variances

For the problem at hand, using Fraser’s notation, we have n-dimensional normal (x_1, \ldots, x_n) with k constraints, expressed as:

\sum_{j=1}^n a_{p,j}x_j =0, (p = n - k +1, \ldots, n)

In our case k = 1 and a_{n,1} = a_{n, 2} = \ldots = a_{n,n} = 1

Then you can follow Fraser’s method to find an n-1 dimensional covariance matrix T = ||\tau_{r,s}||, (r,s = 1, \ldots {n - 1}) and orthogonal matrix B = ||b_{i,j}||, (i,j = 1, \ldots, n).

Then if y_1, \ldots, y_{n - 1} \sim MVN(0, T) we have x_i = \sum_{r = 1}^{n - 1} b_{r,i} y_r satisfying the constraint and with the correct marginal distribution.

I didn’t do the math to double check, but it appears that the code a) uses a QR decomposition of B to get a more efficient computation for the transformation from y to x and b) that T is a diagonal matrix, with \tau_{i,i} = \frac{1}{\sqrt{1 - \frac{1}{n}}} for all i.

Does that clarify?

2 Likes