I’m using Stan for a hierarchical model and I have a quick question about initializing and sampling of a Cholesky correlation matrix when I’m interested in subject-level correlations of some, but not all, parameters.

For example, let’s say I have a model with three fixed-effects, but it does not make sense, given the knowledge of the system, to model the correlation between the second and third parameter.

In this simple example, I’ve learned that the following pseudo-code does not work because the return type of lkj_corr_cholesky(real) is matrix:

The output from lkj_corr_cholesky is not a correlation matrix, but only the Cholesky factor of a correlation matrix. That means that element (2,1) of the Cholesky factor is in general not directly indicative of any correlation between your 2nd and 1st variables.

If you know the structure of your correlation matrix already, then you may wish to avoid setting a prior for the matrix as a whole and just set a prior for each of its unknown elements. Maybe do something like the following?

If you want to use cholesky_factor_corr as the type for my_corr_mat_chol, then you’ll have to declare and initialize my_corr_mat_chol in the transformed parameters block instead of the model block. That will also cause RStan, PyStan, etc., to save samples of my_corr_mat_chol in their respective fit objects, which may not be what you want.

That’s a possibility, but you should check to make sure that your correlation matrix actually is positive definite, especially since there’s a significant potential for error if you’re constructing that matrix manually. You might want to use a symbolic computation tool (like SymPy or Maxima) to find the inverse or eigenvalues of your correlation matrix to see if it’s reasonable.

If we set priors for each of the unknown elements instead of the whole correlation matrix, is it possible that we will encounter the nonpositive definite problem during the sampling? How can we avoid this?

The usual approach would be to set the priors on individual elements that get combined together in a way that ensures positive definiteness, like in the approach I linked to.

I’m afraid I was confused earlier and the original mock example I presented was unnecessarily complicated for what I needed. So please let me rephrase my question by saying that I’m interested in correlations of parameters among some but not all fixed effect parameters in a hierarchical model. I’m realizing now that this does not mean I need a sparse matrix. Rather, I think I just need a correlation matrix whose dimension is smaller than the number of fixed effects parameters.

Please imagine a different mock example where I have 4 fixed effects: theta1, theta2, theta3 and theta4. However, I’m only interested in modelling the correlations among the first 3 parameters given the knowledge of the system. In this example case, I am thinking of writing the following (pseudo-) stan programme. I would be very grateful if you could comment on whether it is okay to implement a smaller matrix for Cholesky decomposition and model the subject-level effect of theta4 independently as I have coded below.

parameters {
....
vector<lower=0>[4] sigma_u; //subject-level sd
cholesky_factor_corr[3] my_corr_mat_chol;
matrix[3,J] z_u; // where J is the number of subjects
vector[J] u_4; //subject-level deviation for the forth parameter
....
}
transformed parameters {
....
matrix[3,J] = u_123; // subject-level deviations for the first three parameters
vector[4] theta[J];
u_123 = diag_pre_multiply(sigma_u, my_corr_mat_chol) * z_u;
for(j in 1:J) {
real theta[j][1] = theta1 + u_123[1,j];
real theta[j][2] = theta2 + u_123[2,j];
real theta[j][3] = theta3 + u_123[3,j];
real theta[j][4] = theta4 + u_4[j]);
}
....
}
model {
....
my_corr_mat_chol ~ lkj_corr_cholesky(2.0);
u_4 ~ normal(0, sigma_u[4]);
to_vector(z_u) ~ normal(0,1);
target += map_rect(my_objective_function, phi, theta, x_r, x_i)
}

I am reading this thread - out of curiosity. Could you please suggest a book in which this Cholesky story is nicely explained (to someone who has read the Bishop book)?

I would like learn enough basic background so that I can understand the main points of this thread. (I have already OK background, but this Cholesky story is a bit new to me and I rather read a good book than trying to reverse engineer Wikipedia). Many thanks in advance.

Not sure what you mean by “Cholesky story”. For my purposes, the Cholesky decomposition is pretty much a “black box” algorithm that takes in a matrix \mathbf{M} and finds a lower-triangular matrix \mathbf{L} that satisfies \mathbf{M} = \mathbf{L}\mathbf{L}^T. The reason that’s of interest to me is a multivariate normal vector \mathbf{x} can be expressed as \boldsymbol{\mu} + \mathbf{L}\boldsymbol{\xi}, where the elements of \boldsymbol{\xi} are independently sampled from a normal distribution with zero mean and a variance of 1, and the covariance of \mathbf{x} is \mathbf{L}\mathbf{L}^T. This is mentioned in the Stan 2.18 User Guide, in the subsection “Multivariate Reparameterizations.”

Wow, thanks @jjramsey I think you managed to
make me get a good intuition about what this “cholesky business” is about . I still need to write this down and/or think over how this relates to the log likelihood but I don’t think that will be difficult. I think you have just explained to me the most important part of this “story”, thanks !