Generalizing lkj_corr_cholesky via Reduced Parameters Onion

Most of the models I estimate have parameters with a multivariate normal prior. The correlation matrix in the prior can be estimated via lkj_corr_cholesky. But as the dimension of the correlation matrix dimension increases (mine are often 100-200), the correlations generated from lkj_corr_cholesky also tend to get smaller (even with eta =1), and appear to underestimate correlations in my data.

I coded the Onion method for generating LKJ correlations based on this post. https://discourse.mc-stan.org/t/lkj-corr-lpdf-correlation-matrix-is-not-positive-definite/31995/7. The results I got from using that are nearly identical to lkj_corr_cholesky. But rather than coding the full cholesky via Onion, I tried to estimate a reduced version. This reduced version fixes the off-diagonal elements to 0 for columns above a threshold (bold cells below). For example, we can reduce this 7 dimension corr_cholesky to 3 columns estimated with fixed 0’s in cols 4-6:

Col1 Col2 Col3 Col4_F Col5_F Col6_F Col7
1 0 0 0 0 0 0
0.3 0.9539 0 0 0 0 0
0.4 0.8 0.4472 0 0 0 0
-0.4 -0.3 -0.5 0.7071 0 0 0
0.3 0.7 0.1 0 0.6403 0 0
0.3 0.5 0.2 0 0 0.7874 0
0.6 0.4 0.3 0 0 0 0.6245

The beta priors are the same for the R2 sumsq elements in rows 4-7 (where typically they change because we have increasing parameters). The resulting correlation matrices span a wider range of values. A correlation matrix of dimension 100, when “reduced” to just 10 columns gives the same distribution of correlations as a 10 dimension lkj_corr_cholesky matrix.

So far this appears to work well. Interestingly, despite fewer parameters, estimation time is not reduced, and in fact slightly longer than the full estimation. I am guessing this is because the beta priors in the reduced version force Stan to explore a wider range of correlations.

I’ve attached my code below. Please criticize if you think this is problematic, or test yourself. If P_red = P, then one gets the full standard Onion = lkj_corr_cholesky.

functions {
  array[] vector beta_shapes(int P, int P_red, real eta) {
    // sumsq of non-diagonal each row = R2 has beta(shape1, shape2) priors
    real alpha = eta + (P_red - 2) / 2.0;
    array[2] vector[P - 1] shape;
    
    shape[1][1] = alpha;
    shape[2][1] = alpha;
    int k = 1;
    for (i in 2 : (P - 1)) { // P-1 corresponds to final Pth diagonal  
      if (k < (P_red-1)) {
        k +=1;
        alpha = alpha - .5;
      }
      shape[1][i] = k / 2.0;
      shape[2][i] = alpha;
    }
    return shape;
  }
  
  matrix lkj_onion(int P, int P_red, row_vector l, vector R2) {
    matrix[P, P] L = rep_matrix(0, P, P); // cholesky_factor corr matrix
    {
      int start = 1;
      int end = 2;
      
      L[1, 1] = 1.0;
      // Row 2
      L[2, 1] = 2.0 * R2[1] - 1.0; 
      L[2, 2] = sqrt(1.0 - square(L[2, 1])); // Repeat for Rows 3 to P
      int row_n = 2; // Number of entries for the row
      for (krow in 3 : P) {
        int r_i = krow-1;  // R2 index: one less than row 
        row_vector[row_n] l_row = l[start:end]; // l is long vector, get the subset for this row
        real scale = sqrt(R2[r_i] / dot_self(l_row)); // sum_sq(nondiag) = R2
        L[krow, 1:row_n] = l_row * scale; 
        L[krow, krow] = sqrt(1.0 - R2[r_i]); // diagonal so sum_sq(all) = 1
        if (krow <= P_red) row_n +=1; // Increase row_n until we hit P_red
        start = end + 1;
        end = start + row_n - 1;
      }
    }
    return L;
  }
}
data {
  int<lower=0> P; // dim of corr mat
  int<lower=0, upper = P> P_red; // non-diag cols 1:P_red are estimated, >P_red fixed at 0
  real<lower=0> eta; // lkj param
}
transformed data {
  array[2] vector[P - 1] diag_beta_shapes = beta_shapes(P, P_red, eta);
  int n_offdiag = P_red*(P - P_red) + choose(P_red, 2); // off-diagonal parameters given reduced  
}
parameters {
  row_vector[n_offdiag - 1] l; // off diagonal elements (except [2,1]) put into one vector
  vector<lower=0, upper=1>[P - 1] R2; // SumSquares of non-diagonal, 1st row has none, estimate 2:P  
}
transformed parameters {
  cholesky_factor_corr[P] L = lkj_onion(P, P_red, l, R2); // Make lower chol(LKJ)
}
model {
  target += std_normal_lpdf(l);
  target += beta_lpdf(R2 | diag_beta_shapes[1], diag_beta_shapes[2]);
}
generated quantities {
  matrix[P, P] cor_LKJOnion = multiply_lower_tri_self_transpose(L);
}
1 Like

I think that the constraint you are imposing with the zeros in your Cholesky factor is that elements with indices corresponding to colums with structural zeros in the Cholesky factor are modeled as conditionally independent, conditional on the previous elements. If that is an accurate or reasonable simplification to make, it seems desirable to do so.

It’s expected that the LKJ prior with parameter \eta = 1 puts more prior mass on smaller correlations as the dimension of the correlation matrix grows. The LKJ(1) prior is uniform over all possible correlation matrices (under the natural nxn dimensional volume measure for matrices), and valid correlation matrices with large correlations become increasingly rare as the dimensionality of the matrix increases. If this behavior is undesirable and you would like to put more prior mass on valid correlation matrices with large correlations, you can reduce the value of \eta in the LKJ distribution. If your main concern is about developing a prior that sees large correlations as likely, and not about enforcing conditional independence between some elements of the outcome, then I would suggest experimenting with \eta < 1 in the LKJ prior rather than fixing some elements of the Cholesky factor to zero.

Thanks jsocolar for reading the post and your thoughtful response. With a high dimension correlation matrix (like P=100), eta is almost irrelevant. eta < 1 is nearly identical to eta = 1 or even 10. At least that is what I see based on my testing (see random draws histograms below). Someone please correct me if I’m wrong.

data {
int<lower=2> P;
real<lower=0> eta;
}

generated quantities{
matrix[P,P] lkj_corr = lkj_corr_rng(P, eta);
}

Your thoughts on conditional independence make sense. But I have not been able to observe that empirically. For the reduced P, I am using 10 + sqrt(P -10), so >10. If I use a reduced P of say 4, I expect I would be more likely to observe conditional independence. But again curious on other’s thoughts. I am still testing this and will update if I find otherwise.

3 Likes

You’re totally right, and I’m a bit surprised. In case it’s useful to anyone here’s a self-contained script to play around with at the end of this message.

In particular, I’m surprised at how for modest dimension, the LKJ prior refuses to put much mass on quite high correlations even for vanishingly small \eta, and how this becomes even more stubborn as the dimensionality increases.

Perhaps enforcing some kind of external structure is the best/only way to ensure some very high correlations here, either through some latent factors, some kind of block structure, or specific priors on certain elements, as appropriate.

Note, I could be missing something, but I still think your strategy above amounts to enforcing a structure of conditional independnece, where the last k elements of the vector are conditionally independent of one another conditional on the values of the first n-k elements. Note that this doesn’t mean the last k elements will be uncorrelated with one another, because any of them can depend on the first n-k elements.

library(cmdstanr)
file <- write_stan_file("
functions {
  matrix lkj_corr_exposed_rng(int m, real eta) {
    return lkj_corr_rng(m, eta);
  }
}
")
mod <- cmdstan_model(file, compile_standalone = TRUE)
mod$expose_functions(global = TRUE)

element12 <- vector()
for(i in 1:10000){
 lkj <- lkj_corr_exposed_rng(10, 1)
 element12[i] <- lkj[1,2]
}
hist(element12, xlim = c(-1, 1))
3 Likes

There’s a geometric reason and there’s another LKJ specific reason. I’ll try to explain it without AI or pictures so you might want to consult additional sources. On the geometric reason, there’s a surprising feature of positive definite correlation matrices is that the mass will concentrate at diagonal 1s as the dimension increases. There’s just no way to satisfy positive definiteness with even one large-in-magnitude (e.g. -0.8) correlation value. You can get a bit of intuition about this with the Cholesky factor. If one of those values is large there’s less “stick” left for the rest of the vector. Now, if we’re using a lower triangular Cholesky factor, go to row N (where N > 30). This row multiplies across every prior row to get the correlation matrix. If there is even 1 large element here it means there’s an astonishingly large amount of information that swamps the entire matrix for this one value. Although it’s possible to get large values, the probability of getting it is very, very small with large dimensions. The LKJ is looking at the entire correlation matrix as one entity and places density uniformly across the entire matrix. You must think of the matrix itself almost like a “value”.

And there’s the rub, an LKJ is working across all correlation matrices and doesn’t “know” your structure. It makes no assumption about structure - like blocks - and is probably inappropriate for many models. Give me a week or so and I’ll show you a model where I put LKJ priors on blocks within a correlation matrix and this was much better for this particular problem because we were agnostic within the blocks but we knew the block structure. Simulations showed much better recovery because the structure existed and an LKJ on the entire matrix was like telling the model that it was a plain-jane correlation matrix and to treat the entire matrix as one giant value. It’s possible to do this because each sub-block is PD (though not necessarily a correlation matrix but it can be decomposed into such).

I’ve been meaning to write up a blog or preprint with all these various insights about correlation matrices found or hidden in the literature for probabilistic programmers. I imagine that many different fields have looked at these PD matrices and these things are well-known within fields but lay hidden for your typical applied statistician or data modeler. If anyone wants to help with this, it could be a fun doc to put together.

7 Likes

@spinkney I’m kind of in a similar boat where most of the data that I work with is strongly correlated with each other. So working with LKJ correlations has been a big pain. The two things that I found worked better were 1) use a regression approach (I’ve detailed on the forums elsewhere), 2) unwrap the LKJ correlation to allow more flexibility. Number 2 is kind of in line with some of what I have written on the forums about vine copulas. My hunch was that the LKJ correlation constrained the correlation matrix too much and allowing more flexibility in partial correlation was a better approach.

My only issue with a “block” approach, if I’m understanding you correctly, is that it requires knowing what the blocks are in advance. That works in some cases, but not in others.

Just following up on my prior post, I think one thing that would really help is for the Stan documentation to explain more detail about the LKJ prior. For instance, talking about how the partial correlations are distributed would be a big step up (an LLM on how to make the documentation more detailed might give good suggestions on this part). Following on this, I also suggest make available a function in stan that takes canonical partial correlations and converts them to a Cholesky so that it is easier for users to make their own assumptions on the partials (ChatGPT did the version below assuming that the user simulates raw unconstrained variables and then translates them to -1 and 1 through a tanh, I’m fairly sure I’ve done one before myself, but would take longer to find, and you might already have one in your useful Stan functions library).


functions {
  // Build the Cholesky factor of a correlation matrix from
  // unconstrained C-vine / canonical partial-correlation parameters.
  //
  // theta is ordered by vine level:
  // level 1: rho_{1,2}, rho_{1,3}, ..., rho_{1,K}
  // level 2: rho_{2,3;1}, rho_{2,4;1}, ..., rho_{2,K;1}
  // ...
  // level K-1: rho_{K-1,K;1:(K-2)}
  //
  // We transform theta -> z = tanh(theta), then use the standard
  // canonical partial-correlation to Cholesky construction.
  matrix cvine_theta_to_cholesky(vector theta, int K) {
    matrix[K, K] z;
    matrix[K, K] L;
    int pos;

    z = rep_matrix(0.0, K, K);
    L = rep_matrix(0.0, K, K);
    pos = 1;

    for (lev in 1:(K - 1)) {
      for (i in (lev + 1):K) {
        z[i, lev] = tanh(theta[pos]);
        pos += 1;
      }
    }

    L[1, 1] = 1.0;
    for (i in 2:K) {
      real prod_term;
      prod_term = 1.0;
      for (j in 1:(i - 1)) {
        L[i, j] = z[i, j] * prod_term;
        prod_term *= sqrt(1.0 - square(z[i, j]));
      }
      L[i, i] = prod_term;
    }

    return L;
  }
}

You’re way better at this geometry stuff than me, but I think this is just wrong.

I can define a positive-define correlation matrix with entries \Sigma_{m, n} = \rho^{|m - n|} and it will be positive-definite as long as \rho \in (-1, 1). So it can have a lot of high positive or negative correlations.

I think you also wind up sampling values with high correlations if you set \nu < 1 in the LKJ prior. I think it’s like the beta distribution that way and this pushes the mass into the corners. I’m not sure what happens when \nu = 1 (uniform) to the distribution of correlations. The random graph theorists must know this as they’re the ones who came up with these vegetation-named models (like vine and onion) in the first place.

Don’t know why I didn’t just try this. I have Stan right here :-)

I defined a model that’s an N \times N correlation matrix with uniform distribution.

data {
  int<lower=0> N;
}
generated quantities {
  matrix[N, N] x = lkj_corr_rng(N, 1.0);
}

Using 1.0 as \nu gives it a uniform distribution.

Taking N = 2, the x[1, 2] entry has 90% central interval of (-0.9, 0.9). Taking N = 10, that goes down to (-0.5, 0.5). Taking N = 100, CmdStanPy takes a minute or so to do the summary, but it goes even further down to (-0.16, 0.16).

If I take \nu = 0.01, then for N = 10, it’s only (-0.55, 0.55). I would’ve expected that to push more mas into the corners. Even at \nu = 10^{-10} it’s still 0.55.

The largest equicorrelation matrix, where every correlation is equal, is -\frac{1}{n-1} < \rho < 1. In your example you’re not looking at the entire matrix but at one value and yes that range can easily be -1 to 1. For example, if I have all 0s in the matrix everywhere but that one value then that value in a super large correlation matrix can be between -1 and 1. My point is that with large correlation matrices it is highly unlikely to have large values when uniformity is expected. Having a bunch of values between -0.3 and 0.3 constrains the entire matrix due to positive definiteness.

Awhile ago I wrote up this which gets the bounds of one unknown correlation value in a correlation matrix given all the other bounds.

## Written by Sean Pinkney
## October 12, 2020


demo_matrix <- matrix(c(1, 0.5, 0.5, 0, 0.5, 1, 0.5, 0.3, 0.5, 0.5, 1, -0.1, 0, 0.3, -0.1, 1), 
                   nrow = 4, ncol= 4)

demo_matrix[2, 1] <- NA
demo_matrix[1, 2] <- NA

cor_swap <- function(mat){
  num_na <- which(is.na(mat), arr.ind = T)
  if(!isSymmetric(mat)) stop('Input matrix must be symmetric')
  
  container_mat <- mat
  num_na <- which(is.na(mat), arr.ind = T)
  n <- nrow(mat)
  
  container_mat[num_na[1], ] <- mat[n, ]
  container_mat[n, ] <- mat[num_na[1], ]
  
  switch_a <- container_mat[ , num_na[2]]
  switch_b <-  container_mat[, n - 1 ]
  container_mat[ , n - 1] <- switch_a
  container_mat[ , num_na[2]] <- switch_b
  
  switch_a <- container_mat[ , num_na[2, 2] ]
  switch_b <- container_mat[, n ]
  container_mat[, n] <- switch_a
  container_mat[,  num_na[2, 2]] <- switch_b
  
  switch_a <- container_mat[num_na[2, 1], ]
  switch_b <- container_mat[n - 1, ]
  container_mat[n - 1, ] <- switch_a
  container_mat[num_na[2, 1], ] <- switch_b
  return(container_mat)
}

# does the second to last missing chol
chol_mis <- function(mat){
  n <- nrow(mat)
  L <- diag(0, n)
  
  for (j in 1:n) {
    if (j == 1)
      L[1, 1] <- sqrt(mat[1, 1])
    else
      L[j, j] <- sqrt(mat[j, j] - L[j, 1:(j - 1)] %*% L[j, 1:(j - 1)])
    
    if (j != n) {
      for (i in (j + 1):n) {
        if (i == n &
            j == n - 1)
          1  # adding this line to skip over the missing piece
        else
          L[i, j] <-
            (mat[i, j] - L[i, 1:(j - 1)] %*% L[j, 1:(j - 1)]) / L[j, j]
      }
    }
  }
  return(L)
}

chol2angle <- function(chol_mat, missing = T){
  n <- nrow(chol_mat)
  mat <- diag(0, n)
  mat[, 1] <- acos(chol_mat[, 1])
  for (i in 3:n){
    for (j in 2:(i - 1)){
      mat[i, j] <- acos(chol_mat[i, j] / prod(cos(chol_mat[i, 1:(j - 1)])))
    }
  }
 if(missing == T) mat[n, n - 1] <- 0
  return(mat)
}

angle2cor <- function(angle_mat){
  n <- nrow(angle_mat)
  inv_mat <- diag(0, n)
  inv_mat[, 1] <- cos(angle_mat[, 1])
  for (i in 2:n) {
    inv_mat[i, i] <- prod(sin(angle_mat[i, 1:(i - 1)]))
    if (i > 2) {
      for (j in 2:(i - 1)) {
        inv_mat[i, j] <-
          cos(angle_mat[i, j]) * prod(sin(angle_mat[i, 1:(j - 1)]))
      }
    }
  }
  return(inv_mat)
}

get_bounds_of_corr_mat <- function(mat) {
  test_matrix <- cor_swap(mat)
  demo_chol <- chol_mis(test_matrix)
  angle_matrix <- chol2angle(demo_chol)
  n <- nrow(angle_matrix)
  angle_matrix[n, n - 1] <- 0
  ub <- tcrossprod(angle2cor(angle_matrix))[n, n - 1]
  angle_matrix[n, n - 1] <- pi
  lb <- tcrossprod(angle2cor(angle_matrix))[n, n - 1]
  print(mat)
  print("upper and lower bounds of missing value")
  print(c(ub, lb))
}

get_bounds_of_corr_mat(demo_matrix)

Let’s say that I mess with the row 1, column 4 element:

a <- c(-0.8, -0.1, 0, 0.1, 0.8)
for (i in 1:5) {
  change_ele <- a[i]
  demo_matrix[1, 4] <- change_ele
  demo_matrix[4, 1] <- change_ele
  get_bounds_of_corr_mat(demo_matrix)
}
[1] "upper and lower bounds of missing value"
[1]  0.3436571 -0.3600846
[1] "upper and lower bounds of missing value"
[1]  0.9187720 -0.4532005
[1] "upper and lower bounds of missing value"
[1]  0.9532005 -0.4187720
[1] "upper and lower bounds of missing value"
[1]  0.9785421 -0.3752566
[1] "upper and lower bounds of missing value"
[1] 0.6999244 0.3853601

clearly the elements already in the matrix forces the remaining elements in order to stay positive definite.

1 Like

That’s a really neat function @spinkney . What caught me flat-footed is that I always assumed that by cranking down \nu in the LKJ prior, we could force the matrix into regions where we expect to see large correlations, sparsely. As you point out, these would almost necessarily be sparse, but I didn’t realize that there was no possible LKJ prior that would see them as remotely possible, sparsely. For example, there’s nothing I can do with an LKJ prior to have a chance of getting a single correlation above 0.7 in a 50 x 50 correlation matrix.

library(cmdstanr)
file <- write_stan_file("
functions {
  matrix lkj_corr_exposed_rng(int m, real eta) {
    return lkj_corr_rng(m, eta);
  }
}
")
mod <- cmdstan_model(file, compile_standalone = TRUE)
mod$expose_functions(global = TRUE)

element12 <- element_max <- vector()
for(i in 1:10000){
  lkj <- lkj_corr_exposed_rng(50, 1e-10)
  element12[i] <- lkj[1,2]
  element_max[i] <- max(lkj[lower.tri(lkj)])
}
hist(element12, xlim = c(-1, 1))
hist(element_max, xlim = c(-1, 1))
1 Like

I’m sailing beyond the limits of my expertise here, but is the inability to get a large magnitude for a bivariate correlation in a large correlation matrix related to the piranha problem in any way?

That’s also what I thought would happen when I tried it because that’s what happens with a Dirichlet :-).

You have to be careful in how to frame this. We can have a correlation matrix where all the correlations have high values. (We cannot have matrixes with all high negative correlations, though.). For example you can generate as sequence of vectors y_n \in \mathbb{R}^D for large D where all of the components of y_n are highly correlated with each other. Just generate this way:

for n in range(N):
    z = normal_rng(0, 100)
    for d in range(D):
        y[n, d] = z + normal_rng(0, 0.001)
Omega = correlation(y)

Here you’ll get a correlation matrix where almost all the correlations are very close to 1.

I think what is more challenging is to define a random process where the expected correlation matrix is the identity where correlation matrices with high correlations are not too far in the tails. At least I think that’s the problem. I’m also sailing beyond my expertise here!