Sparse sampling for a Cholesky correlation matrix (or modelling correlations among some, but not all parameters)

I’m using Stan for a hierarchical model and I have a quick question about initializing and sampling of a Cholesky correlation matrix when I’m interested in subject-level correlations of some, but not all, parameters.

For example, let’s say I have a model with three fixed-effects, but it does not make sense, given the knowledge of the system, to model the correlation between the second and third parameter.

In this simple example, I’ve learned that the following pseudo-code does not work because the return type of lkj_corr_cholesky(real) is matrix:

parameters {
  .... 
  .... 
  cholesky_factor_corr[3] C_f_c;
  .... 
} 

model {
  .... 
  .... 
  C_f_c[2,1] ~ lkj_corr_cholesky(2);
  C_f_c[3,1] ~ lkj_corr_cholesky(2);
  // but crucially not:  C_f_c[2,3] ~ lkj_corr_cholesky(2);
  .... 
} 

It would be very helpful if I could get some suggestions for sampling sparsely for a Cholesky correlation matrix.

The output from lkj_corr_cholesky is not a correlation matrix, but only the Cholesky factor of a correlation matrix. That means that element (2,1) of the Cholesky factor is in general not directly indicative of any correlation between your 2nd and 1st variables.

If you know the structure of your correlation matrix already, then you may wish to avoid setting a prior for the matrix as a whole and just set a prior for each of its unknown elements. Maybe do something like the following?

parameters {
   real<lower=-1.0, upper=1.0> rho12;
   real<lower=-1.0, upper=1.0> rho13;
}

model {
   matrix[3,3] my_corr_mat;
   matrix[3,3] my_corr_mat_chol;

   for (i in 1:3) {
      my_corr_mat[i,i] = 1.0;
   }

   my_corr_mat[1,2] = rho12;
   my_corr_mat[1,3] = rho13;
   my_corr_mat[2,3] = 0.0;

   my_corr_mat[2,1] = my_corr_mat[1,2];
   my_corr_mat[3,1] = my_corr_mat[1,3];
   my_corr_mat[3,2] = my_corr_mat[2,3];

   my_corr_mat_chol = cholesky_decompose(my_corr_mat);
}
1 Like

Thank you so much for your suggestion, jjramsey.

I like your suggestion a lot. Do you think, it is possible to do:

cholesky_factor_corr[3] my_corr_mat_chol;

instead of:

matrix[3,3] my_corr_mat_chol;

This might be such a trivial point, but I was wondering if cholesky_factor_corr[n] might be lighter than matrix[n,n].

Thank you,

If you want to use cholesky_factor_corr as the type for my_corr_mat_chol, then you’ll have to declare and initialize my_corr_mat_chol in the transformed parameters block instead of the model block. That will also cause RStan, PyStan, etc., to save samples of my_corr_mat_chol in their respective fit objects, which may not be what you want.

I don’t believe this is the case.

Thank you for your input.

Just a quick follow-up on this topic, I have been finding that initialising a sparse matrix myself leads to an initialization error:

Rejecting initial value:
Error evaluating the log probability at the initial value.
Exception: cholesky_decompose: Matrix m is not positive definite

I’m guessing this is related to the numerical problem with cholesky_decompose that has been pointed previously https://discourse.mc-stan.org/t/cholesky-factor-corr-initializing-with-invalid-values/3167/6. But I wonder if there is any way around this.

That’s a possibility, but you should check to make sure that your correlation matrix actually is positive definite, especially since there’s a significant potential for error if you’re constructing that matrix manually. You might want to use a symbolic computation tool (like SymPy or Maxima) to find the inverse or eigenvalues of your correlation matrix to see if it’s reasonable.

I deal with some of these issues on page 10 of: https://www.researchgate.net/publication/324093594_Hierarchical_Bayesian_Continuous_Time_Dynamic_Modeling
maybe it’s helpful :)

1 Like

If we set priors for each of the unknown elements instead of the whole correlation matrix, is it possible that we will encounter the nonpositive definite problem during the sampling? How can we avoid this?

1 Like

The usual approach would be to set the priors on individual elements that get combined together in a way that ensures positive definiteness, like in the approach I linked to.

Thank you, jjramsey, Charles_Driver and Bobby for a thought-provoking discussion.

I’m afraid I was confused earlier and the original mock example I presented was unnecessarily complicated for what I needed. So please let me rephrase my question by saying that I’m interested in correlations of parameters among some but not all fixed effect parameters in a hierarchical model. I’m realizing now that this does not mean I need a sparse matrix. Rather, I think I just need a correlation matrix whose dimension is smaller than the number of fixed effects parameters.

Please imagine a different mock example where I have 4 fixed effects: theta1, theta2, theta3 and theta4. However, I’m only interested in modelling the correlations among the first 3 parameters given the knowledge of the system. In this example case, I am thinking of writing the following (pseudo-) stan programme. I would be very grateful if you could comment on whether it is okay to implement a smaller matrix for Cholesky decomposition and model the subject-level effect of theta4 independently as I have coded below.

parameters {
  .... 
  vector<lower=0>[4] sigma_u; //subject-level sd
  cholesky_factor_corr[3] my_corr_mat_chol; 
  matrix[3,J] z_u; // where J is the number of subjects

  vector[J] u_4; //subject-level deviation for the forth parameter
  .... 
} 
transformed parameters {
  .... 
  matrix[3,J] = u_123; // subject-level deviations for the first three parameters
  vector[4] theta[J];
  u_123 = diag_pre_multiply(sigma_u, my_corr_mat_chol) * z_u;

    for(j in 1:J) {
    	real theta[j][1] = theta1 + u_123[1,j];
    	real theta[j][2] = theta2 + u_123[2,j];
    	real theta[j][3] = theta3 + u_123[3,j];
    	real theta[j][4] = theta4 + u_4[j]);
    }
  .... 
} 

model {
  .... 
   my_corr_mat_chol ~ lkj_corr_cholesky(2.0);
   u_4 ~ normal(0, sigma_u[4]);
   to_vector(z_u) ~ normal(0,1);

  target += map_rect(my_objective_function, phi, theta, x_r, x_i)
} 

Thank you.

Valid syntax, possibly dubious restriction that the fourth is uncorrelated with the previous three.

Thank you, bgoodri.

Hi @jjramsey,

I am reading this thread - out of curiosity. Could you please suggest a book in which this Cholesky story is nicely explained (to someone who has read the Bishop book)?

I would like learn enough basic background so that I can understand the main points of this thread. (I have already OK background, but this Cholesky story is a bit new to me and I rather read a good book than trying to reverse engineer Wikipedia). Many thanks in advance.

Cheers,

Jozsef

1 Like

Not sure what you mean by “Cholesky story”. For my purposes, the Cholesky decomposition is pretty much a “black box” algorithm that takes in a matrix \mathbf{M} and finds a lower-triangular matrix \mathbf{L} that satisfies \mathbf{M} = \mathbf{L}\mathbf{L}^T. The reason that’s of interest to me is a multivariate normal vector \mathbf{x} can be expressed as \boldsymbol{\mu} + \mathbf{L}\boldsymbol{\xi}, where the elements of \boldsymbol{\xi} are independently sampled from a normal distribution with zero mean and a variance of 1, and the covariance of \mathbf{x} is \mathbf{L}\mathbf{L}^T. This is mentioned in the Stan 2.18 User Guide, in the subsection “Multivariate Reparameterizations.”

Wow, thanks @jjramsey I think you managed to
make me get a good intuition about what this “cholesky business” is about . I still need to write this down and/or think over how this relates to the log likelihood but I don’t think that will be difficult. I think you have just explained to me the most important part of this “story”, thanks !

Jozsef