How to sample a (singular) covariance matrix under compositional constraint

Yingnan_Gao · February 3, 2026, 1:50am

Hello Stan Community,

I model the composition of n groups as a sum-to-zero vector X (after centered-log-ratio transformation) such that it follows the multivariate normal distribution with mean 0 and a singular variance-covariance matrix parameter Sigma. I am interested in this singular variance-covariance matrix and would like to estimate it from data.

The straightforward model below does not work because Sigma should be singular and positive-semidefinite, while Stan’s cholesky_factor_corr will produce full-rank and positive-definite correlation matrices and variance-covariance matrices. I got pathological divergences when I attempted to run the above model on my example data.

data{
  int n;
  sum_to_zero_vector[n] X;
}
parameters{
  vector[n] log_sigma;
  cholesky_factor_corr[n] Sigma_cholesky;
}
model{
  X ~ multinormal_cholesky(rep_vector(0.0,n),diag_pre_multiply(exp(log_sigma),Sigma_cholesky));
// Priors
  log_sigma ~ std_normal();
  Sigma_cholesky ~ LKJ(2);
}

An intuitive solution to this problem is to omit the Gth group and specify the G-1 x G-1 sub-matrix of Sigma as as n-1 vector and n-1 Cholesky factor:

parameters{
vector[n-1] log_sigma;
cholesky_factor_corr[n-1] Sigma_cholesky;
}
model{
X[1:(n-1)] ~ multinormal_cholesky(rep_vector(0.0,n-1),diag_pre_multiply(exp(log_sigma),Sigma_cholesky));
// Priors
log_sigma ~ std_normal();
Sigma_cholesky ~ LKJ(2);
}

I got rid of the pathological divergences when I tried to run this improved model on the exmaple data.

However, when I tried to calculate the Gth component of the variances from the estimated log_sigma and Sigma_cholesky, I observed inflated variation in the posterior that is similar to the pathology observed in a trivial trivial implementation of a sum-to-zero vector (i.e., collects uncertainty from previous G-1 components). In addition, when I changed the order of groups in the data (so a different group will be the ‘Gth’ group), I got very different posterior distributions for a group’s variance depending on whether that group is the ‘Gth’ group.

For a sum-to-zero vector, its sampling behaviour can be solved with the sum_to_zero built-in variable, with the Gram-Schmidt process. Alternatively, an explicit MVN can be used as discussed here.

Because sum_to_zero variable behaves well under compositional constraints, I believe there should be a way to tailor the behavior of variance and correlation matrix under the compositional constraints.

So I have two questions:

Question 1: Is there any known way to implement a covariance matrix that is well-behaved under compositional constraints?

Question 2: Is my attempted solution a viable path?

Alternative to sampling variances and correlation matrix separately, I can sample a full-rank (n-1 in our case) variance-covariance matrix from a Wishart distribution in Stan, (though not as preferred and efficient as LKJ). It seems that when I set the parameter S for Wishart distribution as a n-1 correlation matrix with all its off-diagonal elements are -1/(n-1), the prior likelihood remains the same regardless of my choice of the ‘Gth’ group. I have yet to implement this model and try it on my example data and check its performances.

Any help will be much appreciated!

spinkney · February 3, 2026, 3:59am

In transformed data do

vector[n-1] X_unc = sum_to_zero_vector_unconstrain(X);

Then try your 2nd model on this X_unc again

Yingnan_Gao · February 3, 2026, 10:49pm

Hi spinkney,

I was unable to find the function sum_to_zero_vector_unconstrain in the exposed Stan functions. The Stan Math library has a function sum_to_zero_constrain(), but there seems to be no ‘unconstrain’ counterpart for it. Is this some function that I should build myself following Stan’s Gram-Schmidt process implementation?

spinkney · February 3, 2026, 10:51pm

It’s sum_to_zero_unconstrain

Yingnan_Gao · February 4, 2026, 12:16am

Thanks spinkney! I found the function and ran a toy example with data that looks like:

$K
[1] 3

$n
[1] 3

$X
          [,1]       [,2]       [,3]
[1,] 0.3209956  0.3367796 -0.6577751
[2,] 0.3770418 -0.8157524  0.4387106
[3,] 0.4952398 -1.2892500  0.7940102

The data X is generated by the singular variance-covariance matrix:

    [,1] [,2] [,3]
[1,]  1    0   -1
[2,]  0    1   -1
[3,] -1   -1    2

The code using the unconstrained vectors is:

functions{

  tuple(vector,matrix) get_sigma_and_Omega_of_singular_VCoV(matrix VCoV){
    int cV = cols(VCoV);
    vector[cV] one_vec = rep_vector(1.0,cV);
    matrix[cV+1,cV+1] singular_VCoV;
    vector[cV+1] singular_sigma;
    singular_VCoV[1:cV,1:cV] = VCoV;
    singular_VCoV[1:cV,1+cV] = -1.0*VCoV*one_vec;
    singular_VCoV[1+cV,1:cV]= to_row_vector(singular_VCoV[1:cV,1+cV]);
    singular_VCoV[1+cV,1+cV] = one_vec'*VCoV*one_vec;
    singular_sigma = sqrt(diagonal(singular_VCoV));
    return (singular_sigma,diag_post_multiply(diag_pre_multiply(1.0./singular_sigma,singular_VCoV),1.0./singular_sigma));
  }
  
}

data{
  int<lower=1> K; // Number of observations
  int<lower=1> n; // Number of groups

  matrix[K, n] X; // Observations
}

transformed data{
  vector[n-1] mean_zero_1 = rep_vector(0.0,n-1);
  matrix[K,n-1] X_unc;
  for(k in 1:K) X_unc[k] = to_row_vector(sum_to_zero_unconstrain(to_vector(X[k])));
}

parameters{
  vector[n-1] log_sigma;
  cholesky_factor_corr[n-1] Sigma_cholesky;
}

model{
  for(k in 1:K) X_unc[k] ~ multi_normal_cholesky(mean_zero_1, diag_pre_multiply(exp(log_sigma),Sigma_cholesky));
  // Priors
  log_sigma ~ std_normal();
  Sigma_cholesky ~ lkj_corr_cholesky(2);
}

generated quantities {
  vector[n] full_sigma;
  matrix[n,n] full_R;
  (full_sigma,full_R) = get_sigma_and_Omega_of_singular_VCoV(
    multiply_lower_tri_self_transpose(diag_pre_multiply(exp(log_sigma),Sigma_cholesky))
    );
}

When I examine the summary of the generated quantity full_sigma, I have:

# A tibble: 3 × 10
  variable       mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>         <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 full_sigma[1] 0.649  0.567 0.345 0.237 0.302  1.23 1.00     1447.    1015.
2 full_sigma[2] 1.40   1.25  0.618 0.470 0.740  2.61 0.999    1441.    1112.
3 full_sigma[3] 1.48   1.34  0.660 0.519 0.734  2.73 1.00     1261.    1174.

If I switch the second and the third column of the data X, I get seemingly different outcomes (even for the not-switched first element):

# A tibble: 3 × 10
  variable       mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>         <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 full_sigma[1]  1.10  0.954 0.555 0.344 0.570  2.15  1.00     979.     889.
2 full_sigma[2]  1.00  0.874 0.511 0.352 0.492  1.88  1.00     884.     681.
3 full_sigma[3]  1.26  1.10  0.694 0.506 0.530  2.52  1.00     777.     875.

I guess the correlation matrix I estimated here was not the (sub-)correlation matrix of the constrained X? I should design a transformation of the estimated correlation matrix based on the Gram-Schmidt process?

Yingnan_Gao · February 4, 2026, 12:37am

I also implemented the Wishart distribution as the prior for the VCoV matrix with the following code:

functions{

  tuple(vector,matrix) get_sigma_and_Omega_of_singular_VCoV(matrix VCoV){
    int cV = cols(VCoV);
    vector[cV] one_vec = rep_vector(1.0,cV);
    matrix[cV+1,cV+1] singular_VCoV;
    vector[cV+1] singular_sigma;
    singular_VCoV[1:cV,1:cV] = VCoV;
    singular_VCoV[1:cV,1+cV] = -1.0*VCoV*one_vec;
    singular_VCoV[1+cV,1:cV]= to_row_vector(singular_VCoV[1:cV,1+cV]);
    singular_VCoV[1+cV,1+cV] = one_vec'*VCoV*one_vec;
    singular_sigma = sqrt(diagonal(singular_VCoV));
    return (singular_sigma,diag_post_multiply(diag_pre_multiply(1.0./singular_sigma,singular_VCoV),1.0./singular_sigma));
  }
  
  matrix equal_variance_sum_to_zero_VCoV(int M){
    matrix[M,M] eqvL = rep_matrix(-1.0,M,M);
    eqvL = add_diag(eqvL,1.0+M);
    eqvL = eqvL/M;
    return eqvL;
  }
  
}

data{
  int<lower=1> K; // Number of observations
  int<lower=1> n; // Number of groups

  matrix[K, n] X; // Observations
}

transformed data{
  vector[n-1] mean_zero_1 = rep_vector(0.0,n-1);
  cov_matrix[n-1] S = equal_variance_sum_to_zero_VCoV(n-1);
}

parameters{
  real log_sigma;
  cov_matrix[n-1] Sigma;
}

model{
  for(k in 1:K) X[k, 1:(n-1)] ~ multi_normal(mean_zero_1, exp(log_sigma*2)*Sigma);
  // Priors
  log_sigma ~ std_normal();
  Sigma ~ wishart(n+1,S);
}

generated quantities {
  vector[n] full_sigma;
  matrix[n,n] full_R;
  (full_sigma,full_R) = get_sigma_and_Omega_of_singular_VCoV(
    exp(log_sigma*2)*Sigma
    );
}

The generated quantities appears to be not affected by the ordering of the data. If I use the original data, I got:

# A tibble: 3 × 10
  variable       mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>         <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 full_sigma[1] 0.762  0.653 0.445 0.294 0.330  1.56 1.000    1506.    1240.
2 full_sigma[2] 0.883  0.784 0.421 0.299 0.445  1.66 1.00     1269.    1233.
3 full_sigma[3] 1.05   0.945 0.444 0.337 0.562  1.90 1.00     1538.    1477.

If I switched the second and the third column of the data X, I got:

# A tibble: 3 × 10
  variable       mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>         <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 full_sigma[1] 0.743  0.636 0.429 0.276 0.329  1.51  1.00    1703.    1549.
2 full_sigma[2] 1.04   0.947 0.442 0.333 0.558  1.88  1.00    1370.    1374.
3 full_sigma[3] 0.881  0.780 0.434 0.298 0.447  1.71  1.00    1370.    1195.

The two summary tables have highly similar mean estimates after I correct the ordering of the variables.

The only things is, I would like to have the correlation structure compatible with a general linear design in the sense that for two VCoV or correlation matrices W1 and W2, an operation like aW1+bW2 would make sense for any real numbers a and b. I currently define such an operation on the transformed Cholesky factors of correlation matrices, but I don’t have a solution for VCoV yet.

Bob_Carpenter · February 4, 2026, 9:31pm

Sorry for my ignorance (I wasn’t trained as a statistician), but can you say what a “compositional constraint” is. An example would help.

Yingnan_Gao · February 4, 2026, 10:59pm

Hi Bob,

The ‘compositional constraint’ is the sum-to-zero constraint on the random vector (the random composition). This compositional constraint has different forms for the central moments of the vector. For the mean, it has the same sum-to-zero form (as the mean of each element in the vector must sum-to-zero as well). For the singular variance-covariance matrix, however, it has the form of

\Sigma = \begin{vmatrix} \Sigma_{-n} & \Sigma_{-n} \vec{1} \\\\ - \vec{1}^T \Sigma_{-n} & \vec{1}^T \Sigma_{-n} \vec{1} \end{vmatrix}

In the equation above, \Sigma_{-n} is the variance-covariance matrix without the nth column and row, and \vec{1} is a column vector of size n-1 whose elements are all 1.

So far, the sum-zero-vector handles the constraint on the composition itself and its mean very well in my applications, but I have been getting the trouble I mentioned above for the variance-covariances.

spinkney · February 5, 2026, 6:17pm

You need to make sure you’re using the same semi-orthogonal matrix for both the constraining and unconstraining transforms. Unfortunately, Stan doesn’t make the semi-orthogonal matrix so we’d have to find the right permutation matrix to form it and then apply to the estimated singular covariance matrix. Easier is to just roll your own:

library(cmdstanr)

helmert_basis <- function(N) {
    n <- N + 1
    H <- matrix(0, n, N) # first N columns
    for (i in 1:N) { # i = 1,…,N
        H[1:i, i] <- 1 / sqrt(i * (i + 1))
        H[i + 1, i] <- -i / sqrt(i * (i + 1))
    }
    H
}

H <- helmert_basis(2)

y <- rnorm(n_samples)
n <- 3

R_singular <- matrix(c(1, 0, -1,
                       0, 1, -1,
                       -1, -1, 2), nrow=3, byrow=TRUE)

L <- chol(t(H) %*% R_singular %*% H)
x_unc <- matrix(0, nrow = n_samples, ncol = n - 1)
x <- matrix(0, nrow = n_samples, ncol = n)
for (i in 1:n_samples) {
  x_unc[i, ] <- L %*% rnorm(n - 1)
  x[i, ] <- t(H %*% x_unc[i, ])
}

mod <- cmdstan_model("zs_singular_cov.stan", force_recompile=TRUE)

mod_out <- mod$sample(
  data = list(K = n_samples,
  n = 3,
X = x,
H = helmert_basis(2) ),
parallel_chains = 4
)

matrix(mod_out$summary("full_R")$mean, 3, 3)


data{
  int<lower=1> K; // Number of observations
  int<lower=1> n; // Number of groups

  matrix[K, n] X; // Observations
  matrix[n, n - 1] H;
}

transformed data{
  vector[n-1] mean_zero_1 = rep_vector(0.0,n-1);
  matrix[K,n-1] X_unc;
  for(k in 1:K)
    X_unc[k] = to_row_vector(H' * to_vector(X[k]));
}

parameters{
  vector[n-1] log_sigma;
  cholesky_factor_corr[n-1] Sigma_cholesky;
}

model{
  for(k in 1:K) X_unc[k] ~ multi_normal_cholesky(mean_zero_1, diag_pre_multiply(exp(log_sigma),Sigma_cholesky));
  // Priors
  log_sigma ~ std_normal();
  Sigma_cholesky ~ lkj_corr_cholesky(2);
}

generated quantities {
  matrix[n, n] full_R = H * multiply_lower_tri_self_transpose(diag_pre_multiply(exp(log_sigma), Sigma_cholesky)) * H';
}

With output

> mod_out$summary("full_R")
# A tibble: 9 × 10
  variable       mean  median     sd    mad      q5     q95  rhat ess_bulk ess_tail
  <chr>         <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 full_R[1,1]  1.08    1.08   0.0482 0.0467  1.01    1.16    1.00    3973.    2867.
2 full_R[2,1]  0.0224  0.0224 0.0339 0.0332 -0.0333  0.0776  1.00    4134.    2635.
3 full_R[3,1] -1.10   -1.10   0.0596 0.0599 -1.20   -1.01    1.00    3995.    3035.
4 full_R[1,2]  0.0224  0.0224 0.0339 0.0332 -0.0333  0.0776  1.00    4134.    2635.
5 full_R[2,2]  1.01    1.00   0.0446 0.0457  0.936   1.08    1.00    4386.    3223.
6 full_R[3,2] -1.03   -1.03   0.0567 0.0570 -1.12   -0.937   1.00    4356.    3159.
7 full_R[1,3] -1.10   -1.10   0.0596 0.0599 -1.20   -1.01    1.00    3995.    3035.
8 full_R[2,3] -1.03   -1.03   0.0567 0.0570 -1.12   -0.937   1.00    4356.    3159.
9 full_R[3,3]  2.13    2.13   0.0968 0.0971  1.98    2.29    1.00    4238.    2499.

Yingnan_Gao · February 13, 2026, 1:06am

Thanks @spinkney ! I implemented your solution and it worked well with regard to reducing the sensitivity to ordering of the data!

I did a comparison between 3 methods mentioned in this thread and a method that I have been experimenting on. A brief description of the solutions:

The naive solution just specifies the log standard deviation and correlation matrix without the last taxon/group. The Wishart solution utilizes the Wishart distribution in Stan that restrict the variance-covariance matrix to a ‘location’ (S matrix) and a ‘scale’ (nu, how fast the probability density drops when deviating away from the center). The Helmert solution transform the constrained X to an unconstrained space with one less dimension through the canonical Helmert matrix. The average solution that I have been experimenting on averages the prior likelihood across all possible ways of dropping a taxon/group out.

I fitted the same data with shuffled ordering of the taxa/groups: one in the original 1-2-3-4 order, and the other in 4-1-2-3 order. The naive solution has taxa 3 and 4 dropped out respectively, and yielded different mean estimate for the standard deviation (see the graph below). This deviation decreased as the number of observation increases (K=3 to K=20) and the effect of priors decreases (as the data inform the posterior more and more). From this observation, I hypothesize that the deviation in the estimates comes from inaccurately imposed priors that are sensitive to the ordering of the data.

To check on my hypothesis, in the average solution, I captured the prior likelihoods of the normal log standard deviation and the correlation sub-matrix (not the Cholesky factor to avoid the Jacobian matrix in the formula), and the determinant of the correlation sub-matrix and the variance-covariance sub-matrix: (Note: the index represents the row/column that are dropped from the full singular matrix)

  `ll_norm[1]` `ll_norm[2]` `ll_norm[3]` `ll_norm[4]`
         <dbl>        <dbl>        <dbl>        <dbl>
1        -3.11        -3.09        -3.15        -2.89
2        -2.82        -2.94        -2.93        -2.88
3        -2.84        -2.85        -2.85        -2.82

  `ll_lkj_corr[1]` `ll_lkj_corr[2]` `ll_lkj_corr[3]` `ll_lkj_corr[4]`
             <dbl>            <dbl>            <dbl>            <dbl>
1            -1.66            -1.51            -2.02           -0.816
2            -3.19            -2.23            -2.36           -2.91 
3            -1.81            -1.71            -1.73           -1.98 

  `log_det_corr[1]` `log_det_corr[2]` `log_det_corr[3]` `log_det_corr[4]`
              <dbl>             <dbl>             <dbl>             <dbl>
1             -1.04            -0.895             -1.40            -0.200
2             -2.58            -1.61              -1.74            -2.29 
3             -1.19            -1.10              -1.11            -1.37 

  `log_det_VCoV[1]` `log_det_VCoV[2]` `log_det_VCoV[3]` `log_det_VCoV[4]`
              <dbl>             <dbl>             <dbl>             <dbl>
1              1.48              1.48              1.48              1.48
2             -3.41             -3.41             -3.41             -3.41
3             -2.58             -2.58             -2.58             -2.58

It is evident that the plain normal and LKJ priors are dependent on the ordering of the data (as different rows/columns will be dropped). LKJ’s dependence roots from the dependence of correlation sub-matrices’ determinants on the ordering of data. However, the variance-covariance sub-matrices’ determinants are independent (which is consistent with the constraints on the singular VCoV matrix).

From the above observation, it is clear that Wishart solution works with an S matrix parameter with the form

\bar{\sigma^2} \cdot \begin{bmatrix} 1 & -\frac{1}{n} & -\frac{1}{n} & ... \\\\ -\frac{1}{n} & 1 & -\frac{1}{n} & ... \\\\ -\frac{1}{n} & -\frac{1}{n} & 1 & ... \\\\ ... & ... & ... & ... \end{bmatrix}

because the two variable components in the density function are independent to the ordering of data: the determinant of the VCoV sub-matrix is invariant to ordering and the trace of its product with the inverted S (S^{-1}W) is equal to the sum of all n (not n-1) variances. As the graph above shows, the Wishart solution yielded highly similar mean estimates between the two ordering of the data.

I haven’t worked out the math for the Helmert solution, but I do notice that the canonical Helmert matrix would reduce the S matrix with the form above to a diagonal matrix through quadratic multiplication, which I take as ‘centering to zero’ analogically. Nevertheless, the Helmert solution does yield highly similar mean estimates, though the differences seem to be slightly larger than that in Wishart solution (could be just randomness).

My more or less failed attempt was that, if the prior likelihoods’ dependence on ordering of the data is the root of the problem, I can take an average of the log-likelihood across all possible scenarios, and then they should be insensitive to the ordering. However, even if I adjusted the likelihood to reflect the average (the numerical difference I captured from the generated quantities are of 10^-8), I still observe an obvious difference in the mean estimates. I wonder if I missed anything here that makes the average solution not feasible? or if I just implemented the average likelihood wrongly in my code?

The Stan and R code to reproduce the comparison can be found in the attachment.

Singular_VCoV_prior_sensitivity_to_ordering_of_data.R (5.0 KB)

singular_VCoV_naive_solution.stan (1.3 KB)

singular_VCoV_Wishart_solution.stan (1.2 KB)

singular_VCoV_average_solution.stan (2.7 KB)

singular_VCoV_Helmert_solution.stan (1.4 KB)

spinkney · February 15, 2026, 1:18pm

Hey, I don’t quite understand all the plots as they look fairly similar. Can you explain in more detail how the “Helmert” solution helps?

Yingnan_Gao · February 15, 2026, 1:48pm

Hi spinkney, sure I should describe the plot more clearly:
The issue I had for the naive solution (first row) is that the mean estimate (the dots in the plot) and the posterior distribution (error bars illustrating the 90% credible intervals) of the standard deviation for the 3rd and 4th group (T3 and T4) changed when I supplied the same data with the columns in T1-2-3-4 (blue) and in T-4-1-2-3 (red) order. The difference was large when the number of observation was 3 and decreased as the number of observation increased. This decrease with the number of observations and the changing determinant of the [n-1] size correlation matrix led me to think that the posterior distribution of the standard deviations change because the prior likelihood in my naive model for the same singular variance-covariance matrix changes along with the ordering of the data.

The Helmert solution (bottom row) you provided yielded highly similar posterior distribution for the standard deviation of T3 and T4, even when the number of observation was small. I haven’t worked out how the determinant of the transformed correlation matrix will change yet, but my guess is that it will not be sensitive to the ordering of data, and as a result the posterior distributions will not be sensitive as well.

For the Wishart solution, I believe the prior likelihood will be invariant to the ordering of the data, as illustrated in my previous post. The Wishart solution did yield almost identical posterior distributions for standard deviations of T3 and T4.

The average solution supposedly should average (arithmetic mean of log-likelihood) across every group as the last column in the data. I thought it should yield the same prior likelihood for the same singular VCoV regardless of the ordering of data. However, I still see the persistent deviation as that in my naive solution.

Yingnan_Gao · February 17, 2026, 12:22am

I captured the likelihood of data and the LKJ likelihood of the correlation matrix in the generated quantities and compared them with that if the data were ordered differently.

For example, when the input data was ordered as T1-T2-T3-T4 column-wise, I calculated what the prior and data likelihood would be if the same singular VCoV was given in the ILR space (i.e., before transformation by Helmert matrix) but the data was ordered as T2-T3-T4-T1, T3-T4-T1-T2, T4-T1-T2-T3. The same Helmert matrix (without reordering) was used. In the below tables, the index in the bracket indicates the group occupying the first column.

  ll_lkj[1] ll_lkj[2] ll_lkj[3] ll_lkj[4]
1     -0.79      -2.3     -2.30     -1.69
2     -0.85      -1.6     -2.11     -1.56
3     -1.11      -2.1     -1.87     -1.01

  ll_data[1] ll_data[2] ll_data[3] ll_data[4]
1     -12.9      -12.9      -12.9      -12.9
2     -11.2      -11.2      -11.2      -11.2
3     -11.4      -11.4      -11.4      -11.4

It can be seen that the prior likelihood of the Helmert solution is actually sensitive to the ordering of data. The likelihood of data, however, is invariant even the transformed data themselves vary beyond their ordering when the input data are reordered. Yet the final outcome show that Helmert solution yielded much more stable posterior distributions than my average solution, which supposedly gives the same prior likelihood for every possible order of data. So I am confused that what is the real cause of the changing posterior when I reorder the data? Is there some hidden Jacobian structure or something that I did not taken into account?

By the way, I tested more ways of reordering the data, and I found that although the estimates for the Helmert solution are much more stable than my naive and average solution, they seem to be less stable than that from the Wishart solution. So maybe the prior likelihood contributes to the stability of the estimates but there are other more important components that I haven’t been looking at?

To interpret the graph: the mean estimates are marked as dots or triangles (when the taxon is in the last column of the input), the color represents the ordering but the order is concatenated into a ring-like structure (so 1-2-3-4 has the same color as 2-3-4-1 or 3-4-1-2 or 4-1-2-3). The fluctuation of the dots/triangles represents how sensitive the estimates are to the ordering of data. The estimates from the Helmert solution look less like a line when compared to the Wishart estimates.

spinkney · February 17, 2026, 3:19pm

Thanks for the explanation. I also now see the y-axis label off to the left of the chart.

Can you try the wishart_cholesky lpdf as well? I’m curious if this also shows the ordering issue because the Cholesky factorization is sensitive to ordering of the data. That is, if we permute the rows and columns there will be a different Cholesky factor for the same correlation matrix. This is unavoidable.

Yingnan_Gao · February 19, 2026, 5:32am

Hi spinkney, I tried wishart_cholesky() but I still got stable posteriors for the standard deviations:

The captured prior likelihoods for wishart_cholesky() are:

  ll_wishart[1] ll_wishart[2] ll_wishart[3] ll_wishart[4]
1          0.75          0.77           1.1           1.1
2          1.92          1.21           1.3           1.3
3          1.79          1.77           2.1           2.1

The diagonal elements of the Cholesky factor of VCoV won’t simply permute after I permute the VCoV, and the determinant of the Jacobian matrix should be relevant to the first n-1 elements (at least this is the case for LKJ distribution). This explains why the likelihood ll_wishart[1] [2] and [3] are different from each other but [3] and [4] are equal (because the determinant of Jacobian matrix does not change for these two). The index of the likelihood represents the column and row that are removed from the singular VCoV.

The codes (without customized functions) are pasted below:

data{
  int<lower=1> K; // Number of observations
  int<lower=1> n; // Number of groups
  matrix[K, n] X; // Observations
}

transformed data{
  vector[n-1] mean_zero_1 = rep_vector(0.0,n-1);
  matrix[n-1,n-1] S = equal_variance_sum_to_zero_VCoV(n-1);
  matrix[n-1,n-1] L_S = cholesky_decompose(S);
  matrix[n-1,n] transform_X = sum_to_zero_VCoV_transform_matrix(n-1);
}

parameters{
  real log_sigma;
  //cov_matrix[n-1] Sigma;
  cholesky_factor_cov[n-1] L_Sigma;
}

model{
  //for(k in 1:K) X[k,1:(n-1)] ~ multi_normal(mean_zero_1, Sigma);
  for(k in 1:K) X[k,1:(n-1)] ~ multi_normal_cholesky(mean_zero_1, L_Sigma);
  // Priors
  log_sigma ~ std_normal();
  //Sigma ~ wishart(n+1,exp(log_sigma*2)*S);
  L_Sigma ~ wishart_cholesky(n+1,exp(log_sigma)*L_S);
}

generated quantities {
  cov_matrix[n-1] Sigma = multiply_lower_tri_self_transpose(L_Sigma);
  vector[n] full_sigma;
  matrix[n,n] full_R;
  (full_sigma,full_R) = get_std_and_corr_of_VCoV(quad_form(Sigma,transform_X));

  vector[n] ll_wishart;
  vector[n] ll_data = rep_vector(0.0,n);
  for(i in 1:n){
    //ll_wishart[i] = wishart_lpdf(remove_row_col(quad_form(Sigma,transform_X), i)|n+1,exp(log_sigma*2)*S);
    ll_wishart[i] = wishart_cholesky_lpdf(cholesky_decompose(remove_row_col(quad_form(Sigma,transform_X), i))|n+1,exp(log_sigma)*L_S);
    for(k in 1:K){
      //ll_data[i] += multi_normal_lpdf(remove_element(X[k]',i)|mean_zero_1, remove_row_col(quad_form(Sigma,transform_X), i));
      ll_data[i] += multi_normal_cholesky_lpdf(remove_element(X[k]',i)|mean_zero_1, cholesky_decompose(remove_row_col(quad_form(Sigma,transform_X), i)));
    }
  }
}

So it is not the changing value of prior likelihood that caused the differential posterior estimates across ordering of data, but something else about the priors? I am really puzzled now.

Topic		Replies	Views
Underestimating correlation coefficients with LKJ prior Modeling	24	6958	September 15, 2020
Exception: S_n (cov_matrix) not symmetric for large value elements Modeling fitting-issues	10	228	May 5, 2025
Trouble with prior selection for multivariate normal-inverse wishart analysis Modeling	11	3360	October 22, 2018
Speedups for Wishart model Modeling specification , performance	24	4544	November 10, 2020
Sampling covariance matrix with additional constrain Modeling	5	682	December 5, 2018

How to sample a (singular) covariance matrix under compositional constraint

Related topics