Hi,
I am trying to use a multivariate normal to describe a joint distribution of a set of variables to compute and recover the correlation matrix, as opposed to running individual regressions to determine their associations. The data used are comprised of 21 academic grades and scores taken at different points in time from a single cohort of around 120 subjects: some subsets/covariates of the data have a positive strong correlation (~0.6) because are generated via similar testing conditions, while others are left-skewed and censored, e.g. GPA = [3.7, 4] for over 90% of population.
I am trying to follow a procedure similar to the one presented in this entry from Kruschke’s blog, which itself is based on Gelman & Hill (2007) chapters on correlation and covariance matrices. Additionally and more formally, I am following this reference’s technical appendix, which uses MCMC as part of a Structural Equation Model to compute correlations and means of left-skewed censored data, as is the case of some variables in my dataset. I am trying to translate this into Stan to hopefully perform some more complex modeling.
As a starting point, I followed the recommendations from section 1.13 from the manual and I computed the correlation matrix using with LKJ(1) prior instead of the Inv-Wishart (as in Kruschke’s blog). I am okay assuming the uniformity on the correlation matrix since this is the beginning of future improvements and my first raw Stan model.
I have a couple of general questions regarding this problem. First, I am trying to recover the correlation matrix directly from the Stan program, and second, a more conceptual question regarding the actual treatment of censored and skewed data for this type of computation.
I have successfully fitted a model that works:
// The input data is a matrix 'y' of dimensions N x M.
data {
int<lower=0> N; // number of subjects in the cohort
int<lower=0> M; // number of variables
vector[M] zy[N]; // matrix with standardized variables
}
parameters {
vector[M] mu;
vector<lower=0>[M] tau;
cholesky_factor_corr[M] L_Omega;
}
model {
mu ~ normal(0, 2);
tau ~ cauchy(0, 2.5);
L_Omega ~ lkj_corr_cholesky(1);
zy ~ multi_normal_cholesky(mu, diag_pre_multiply(tau, L_Omega));
}
This works fine. The problem comes when trying to recover Omega
as L_Omega*L_omega'
or, Sigma = diag_pre_multiply(tau, L_Omega)
. I am getting the following warning messages for divergent transitions 1k to 2k; ESS warnings and R-hat ~ 3-4.
I have tried to use two methods for recovery: generated quantities
and transformed parameters
blocks. First:
transformed parameters {
matrix[M, M] Sigma = diag_pre_multiply(tau, L_Omega);
}
model {
mu ~ ...
tau ~ ...
L_Omega ~ lkj_corr_cholesky(1);
zy ~ multi_normal_cholesky(mu, Sigma);
}
Second:
model {
mu ~ ...
tau ~ ...
L_Omega ~ lkj_corr_cholesky(1);
zy ~ multi_normal_cholesky(mu, diag_pre_multiply(tau, L_Omega));
}
generated quantities {
matrix[M, M] Omega = L_Omega * L_Omega';
}
- What is different about these last two variations that makes everything break?
- I am able to extract the mean of
L_Omega
from the fit in R and computeOmega <- L_Omega*Conj(t(L_Omega))
and numbers seem to work. However, I am seeing0.7 < rho[i,i] < 1
diagonal entries and that is why I would like to compute it directly in Stan. - I tried generating Sigma because the standardized covariance is the correlation matrix but everything breaks as well.
- Once I obtain the correct output for the correlation matrix, is there a way to recover the variable names as well? What is the default column order in Stan?
I have already computed the correlations using other packages in R and the numbers obtained from the working model seem to agree, except for the diagonals.
- Finally, I would like to accurately include the skewed and censored data in the model, so I am wondering if there is literature or suggestions as to how to do this when measuring the correlation?
I am running the model on my local machine and on a cluster and the warnings are present in both. I am certain this is related to a misunderstanding of the way I compute the correlation overall.