 Recovering correlation matrix from Multivariate Normal latent model

Hi,

I am trying to use a multivariate normal to describe a joint distribution of a set of variables to compute and recover the correlation matrix, as opposed to running individual regressions to determine their associations. The data used are comprised of 21 academic grades and scores taken at different points in time from a single cohort of around 120 subjects: some subsets/covariates of the data have a positive strong correlation (~0.6) because are generated via similar testing conditions, while others are left-skewed and censored, e.g. GPA = [3.7, 4] for over 90% of population.

I am trying to follow a procedure similar to the one presented in this entry from Kruschke’s blog, which itself is based on Gelman & Hill (2007) chapters on correlation and covariance matrices. Additionally and more formally, I am following this reference’s technical appendix, which uses MCMC as part of a Structural Equation Model to compute correlations and means of left-skewed censored data, as is the case of some variables in my dataset. I am trying to translate this into Stan to hopefully perform some more complex modeling.

As a starting point, I followed the recommendations from section 1.13 from the manual and I computed the correlation matrix using with LKJ(1) prior instead of the Inv-Wishart (as in Kruschke’s blog). I am okay assuming the uniformity on the correlation matrix since this is the beginning of future improvements and my first raw Stan model.

I have a couple of general questions regarding this problem. First, I am trying to recover the correlation matrix directly from the Stan program, and second, a more conceptual question regarding the actual treatment of censored and skewed data for this type of computation.

I have successfully fitted a model that works:

// The input data is a matrix 'y' of dimensions N x M.
data {
int<lower=0> N; // number of subjects in the cohort
int<lower=0> M; // number of variables
vector[M] zy[N]; // matrix with standardized variables
}

parameters {
vector[M] mu;
vector<lower=0>[M] tau;
cholesky_factor_corr[M] L_Omega;
}

model {
mu ~ normal(0, 2);
tau ~ cauchy(0, 2.5);
L_Omega ~ lkj_corr_cholesky(1);
zy ~ multi_normal_cholesky(mu, diag_pre_multiply(tau, L_Omega));
}

This works fine. The problem comes when trying to recover Omega as L_Omega*L_omega' or, Sigma = diag_pre_multiply(tau, L_Omega). I am getting the following warning messages for divergent transitions 1k to 2k; ESS warnings and R-hat ~ 3-4.

I have tried to use two methods for recovery: generated quantities and transformed parameters blocks. First:

transformed parameters {
matrix[M, M] Sigma = diag_pre_multiply(tau, L_Omega);
}

model {
mu ~ ...
tau ~ ...
L_Omega ~ lkj_corr_cholesky(1);
zy ~ multi_normal_cholesky(mu, Sigma);
}

Second:

model {
mu ~ ...
tau ~ ...
L_Omega ~ lkj_corr_cholesky(1);
zy ~ multi_normal_cholesky(mu, diag_pre_multiply(tau, L_Omega));
}
generated quantities {
matrix[M, M] Omega = L_Omega * L_Omega';
}
• What is different about these last two variations that makes everything break?
• I am able to extract the mean of L_Omega from the fit in R and compute Omega <- L_Omega*Conj(t(L_Omega)) and numbers seem to work. However, I am seeing 0.7 < rho[i,i] < 1 diagonal entries and that is why I would like to compute it directly in Stan.
• I tried generating Sigma because the standardized covariance is the correlation matrix but everything breaks as well.
• Once I obtain the correct output for the correlation matrix, is there a way to recover the variable names as well? What is the default column order in Stan?

I have already computed the correlations using other packages in R and the numbers obtained from the working model seem to agree, except for the diagonals.

• Finally, I would like to accurately include the skewed and censored data in the model, so I am wondering if there is literature or suggestions as to how to do this when measuring the correlation?

I am running the model on my local machine and on a cluster and the warnings are present in both. I am certain this is related to a misunderstanding of the way I compute the correlation overall.

Perhaps I’m missing the point, but don’t you just want to reverse the cholesky factorisation in order to recover the correlation matrix?

1 Like

Yes! I am trying to do that, both in the generated quantities and the transformed parameters block. I also did it directly in R like this:

Omega <- apply(params\$L_Omega, 1, function(x) x %*% Conj(t(x))),

but wanted to confirm that it is correct by generating it directly from the Stan code, and this is when things break.

Thanks for your help!