Specifying normal priors on sum-to-zero vectors

Hello,

I have been working on some multivariate hierarchical models involving sum-to-zero vectors. Specifically, I have a vector of means (of centered log ratios) that sums up to zero. I currently give them a normal prior using the following codes: (I edited to adjust the standard deviation)


parameters {
  sum_to_zero_vector[3] beta;
}
model {
  beta ~ normal(0, sqrt(3 * inv(3 - 1))*1); 
} 

What I learned was that Stan can handle the inherent negative correlation internally.

Alternatively, I can specify multivariate normal prior with a specific variance-covariance matrix on the first two elements, and calculate the last element from the previous ones.

transformed data {
matrix[2,2] VCoV = rep_matrix(-0.5,2,2); // [1,-0.5;
VCoV = add_diag(1.5,VCoV);               // -0.5,1] 
}

parameters {
  vector[2] beta_raw;
}
transformed parameters {
  vector[3] beta;
  beta[1:2] = beta_raw;
  beta[3] = 0.0-sum(beta_raw)
}
model {
  beta_raw[1:2] ~ multinormal(rep_vector(0.0,2),VCoV)
} 

My derivation is that this specific variance-covariance matrix would yield beta[3] to have the same variance as beta[1] and beta[2].


What I would like to ask is that are these two methods of specifying a normal prior on the sum-to-zero vector equivalent?

Hi @Yingnan_Gao welcome to the Stan forums! If you’ve done the math and beta results in the same pushforward distribution then they are equivalent (I had an LLM check it and it did say they are the same). The nice thing about sum-to-zero is that the log-det-jacobian is zero. I do note that mathematical equivalence doesn’t always translate into numerical equivalence.

A few comments on why you might notice computational differences however. Your problem is low-dimensional which means you probably won’t notice much of timing difference, though you might still see some numerical artifacts that pop-up in warmup. That multivariate normal is a computational party-pooper. Anytime you can reduce from a mvn to a univariate normal means you avoid forming the covariance matrix and avoid the inversion of it for the log-density, which is what the first parameterization does. In the second, you could precalulate the Cholesky factor in transformed data and call multinormal_cholesky or you could precalculate the inverse and call multinormal_precision, both of which will be faster. The first is because the Cholesky factor solves the covariance inverse through efficient triangular solves and the second because no inverse needs to be solved at each iteration (precomputing in the data or transformed data blocks is a one-off calculation). But you still have to form a matrix and do matrix math for the log-density and derivatives when calling the multinormal_* densities.

Thanks @spinkney ! I wrote the the first second parameterization in that way to emphasize that I use a multi-normal distribution. What I really use in my code is what I believe a non-centered parameterization of the multi-normal prior:

transformed data {
matrix[2,2] VCoV = rep_matrix(-0.5,2,2); // [1,-0.5;
VCoV = add_diag(1.5,VCoV);               // -0.5,1]
matrix[2,2] L = cholesky_decompose(VCoV); 
}

parameters {
  vector[2] beta_raw;
}
transformed parameters {
  vector[3] beta;
  beta[1:2] = L*beta_raw;
  beta[3] = 0.0-sum(beta[1:2]);
}
model {
  beta_raw ~ normal(0.0,1.0);
}

This parameterization should avoid the call of multi-normal density functions.

Another related modeling issue that troubles me is that I would like to model the residual correlation among read count of cell groups. Because standard discrete distributions in Stan do not have a correlation structure, I introduced latent sum-to-zero residuals on the centered log-ratio transformed composition as parameters, and then connect the composition to read counts via the multinomial distribution (instead of solving the compound distribution). Intuitively, the model should be something like:

data {
  array[3] int y; // I keep only one observation here for simplicity
}
parameters {
  cholesky_factor_corr[3] L;
  vector[3] alpha;
  sum_to_zero_vector[3] beta;
  sum_to_zero_vector[3] intermediate_u; // the latent residuals
}
transformed parameters {
  vector[3] sigma = exp(alpha);
  vector[3] pi = softmax(beta+intermediate_u);
}
model {
  // Hyper-priors
  L ~ lkj_corr_cholesky(2);
  alpha ~ normal(0.0,1.0);
  // Priors
  beta ~ normal(0.0,1.0);
  intermediate_u ~ multinormal_cholesky(rep_vector(0.0,3),diag_pre_multiply(sigma,L));

  y ~ multinomial(pi);
}

Of course such a model yield severe divergent transitions across chains on my actual example data (not shown here), and I identified that is because I model the correlation explicitly as a parameter and the ‘correlation matrix’ to which L correspond is a singular one (being semi positive definite instead of positive definite, having rank N-1) due to the sum-to-zero constraint on the residuals.

A more numerically stable version of the model would reduce the dimension of alpha and L to 2 and use an explicit modeling the sum-to-zero constraint similar to what I have posted above.

data {
  array[3] int y; // I keep only one observation here for simplicity
}
parameters {
  cholesky_factor_corr[2] L;
  vector[2] alpha;
  sum_to_zero_vector[3] beta;
  vector[2] intermediate_u_raw; // the latent residuals
}
transformed parameters {
  vector[2] sigma = exp(alpha);
  vector[3] intermediate_u;
  intermediate_u[1:2] = diag_pre_multiply(sigma,L)*intermediate_u_raw;
  intermediate_u[3] = 0.0 - sum(intermediate_u[1:2]);
  vector[3] pi = softmax(beta+intermediate_u);
}
model {
  // Hyper-priors
  L ~ lkj_corr_cholesky(2);
  alpha ~ normal(0.0,1.0);
  // Priors
  beta ~ normal(0.0,1.0);
  intermediate_u_raw ~ normal(0.0,1.0);
  y ~ multinomial(pi);
}
generated quantities {
  real sigma_3 = calculate_the_singular_VCoV_and_return_constrained_sigma(sigma,L);
}

But by doing this, I seem to lose my ability to directly impose a prior on the (log-transformed) standard deviation of the 3rd latent residual, and I would like to keep it. Additionally, I did observe inflated posterior variance on sigma_3, and thus the ordering of the input data would matter.

To summarize, I have a vector of alpha that has a constraint much more complicated than the sum-to-zero one, and I would like to impose a prior (hopefully normal) on the full vector. Is there any way to go around it?

In the case you described I would have suggested the mvn version, however isn’t working well you.

Try using the built-in sum_to_zero_constrain which might be better behaved.

data {
  array[3] int y; // I keep only one observation here for simplicity
}
parameters {
  cholesky_factor_corr[2] L;
  vector[2] alpha;
  sum_to_zero_vector[3] beta;
  vector[2] intermediate_u_raw; // the latent residuals
}
transformed parameters {
  vector[2] sigma = exp(alpha);
  vector[3] intermediate_u;
  intermediate_u[1:2] = diag_pre_multiply(sigma,L)*intermediate_u_raw;
  intermediate_u[3] = sum_to_zero_constrain(intermediate_u[1:2]));
  vector[3] pi = softmax(beta+intermediate_u);
}
model {
  // Hyper-priors
  L ~ lkj_corr_cholesky(2);
  alpha ~ normal(0.0,1.0);
  // Priors
  beta ~ normal(0.0,1.0);
  intermediate_u_raw ~ normal(0.0,1.0);
  y ~ multinomial(pi);
}
generated quantities {
  real sigma_3 = calculate_the_singular_VCoV_and_return_constrained_sigma(sigma,L);
}

That’s right. In 2D, the multivariate densities in Cholesky form are pretty efficient because it’s very easy to compute the determinant and quadratic form, so you might not see much difference from using the direct coding.

parameters {
  vector[2] beta_prefix
}
transformed parameters {
  vector[3] beta = append_row(beta_prefix, 1 - sum(beta_prefix));
}
model {
  beta_prefix ~ multi_normal_cholesky(zero, L);
}

I’d recommend coding this as follows with type hints and explicit construction.

transformed data {
  cov_matrix[2] Sigma = [[-0.5, 1.5], [1.5, -0.5]];
  cholesky_factor_cov[2] L_Sigma = cholesky_decompose(Sigma);
}
parameters {
  vector[2] beta_std; 
}
transformed parameters {
  vector[3] beta;
  beta[1:2] = L_Sigma * beta_std;  // implies beta[1:2] ~ multi_normal(0, Sigma);
  beta[3] = -(beta[1] + beta[2]);  // faster than slicing a vector;   0 - x is just -x
}
model {
  beta_std ~ std_normal();  // standard normal is built in
}

You could code this as follows:

parameters {
  sum_to_zero_vector[3] beta;
}
model {
  beta[1:2] ~ multinormal_cholesky(zero2, L);
}

It’s not quite the same model because of the constraint, but it’s very close.