Estimating correlation matrix for ordinal outcomes


The approach I have tried so far is based on the assumption that the observed ordinal data y is generated from an underlying latent multivariate normal distribution. Hence, the Stan model simulates a multivariate normal distribution (using a LKJ prior) and uses an ordered probit link to model the ordinal outcomes.
Unfortunately, I cannot recover the correct correlation matrix from simulated data using the following model:

functions {
  // for probit approach
  vector calc_theta(real eta, vector cp, int K) {
    vector[K] theta;
    theta[1] = 1 - Phi(eta - cp[1]);
    for (j in 2:(K-1))
      theta[j] = Phi(eta - cp[j-1]) - Phi(eta - cp[j]);
    theta[K] = Phi(eta - cp[K-1]);
    return theta;

data {
  int N;        // number of cases
  int K;        // number of variables
  int nlevels;  // number of levels for oridnal outcomes
  int y[N,K];   // responses on ordinal scale

transformed data {
  vector[K] zeros;                // mean of underlying multivariate normal
  vector[K] L_sigma;           // sd of underlying multivariate normal
  zeros = rep_vector(0,K);
  L_sigma = rep_vector(1,K);

parameters {
  cholesky_factor_corr[K] L_Omega;
  vector[K] y_cont[N];              // continuous "trait" assumed to underly
                                               // responses on ordinal scale    
  ordered[nlevels - 1] cp[K];    // cut points for ordered probit

model {
  L_Omega ~ lkj_corr_cholesky(2);
  y_cont ~ multi_normal_cholesky(zeros,L_Omega);
  for (k in 1:K) {
    cp[k] ~ normal(0,3);
    for (n in 1:N) {
      y[n][k] ~ categorical(calc_theta(y_cont[n][k], cp[k], nlevels));

generated quantities {
  corr_matrix[K] Omega;
  Omega = multiply_lower_tri_self_transpose(L_Omega);

Instead, I get the warning that the “estimated Bayesian Fraction of Missing Information (bfmi) was low”. The entries in the Cholesky factor of the correlation matrix are negatively correlated with the energy__ (strongest r = -.36). I also found that large correlations between data-columns resulted in highly correlated posteriors of the simulated continuous traits underlying these variables. However, these checks did not provide me with insights about how I should change the model.

I do not think that what is going on here is shrinkage towards zero for off-diagonal values of the correlation matrix due to LKJ priors, because I can estimate correlations for continuous variables just fine when I use the same prior.

Does anybody have an idea what I am doing wrong here?

Thanks in advance!

Not a reply, but a question if @bgoodri maybe can have a look?

I also looked a bit more into the problem, and what I found was that when I simply simulate data from a multivariate normal using

y_cont ~ multi_normal_cholesky(zeros,L_Omega);

the columns in y_cont have the expected zero means and sds of (when i submit L_Omega, the cholesky factor of a correlation matrix, as data).

However, if I add the probit-part (y[n][k] ~ categorical(calc_theta(y_cont[n][k], cp[k], nlevels)); to the model and try to estimate L_Omega, colums in y_cont no longer have a zero mean and a sd of 1.

Now I am wondering if the model is maybe not identified? I had expected that using the cholesky factor of a correlation matrix and zero means as input to multi_normal_cholesky should result in zero means and sds of 1 for the columns of y_cont, which then should identify the cut points.

This is not happening, and right now I do not know why.

For models like this, it isn’t y_cont | L_Omega but y_cont | L_Omega, y. In other words, you have to constrain the latent values to be in the correct intervals given the observed outcomes, and, in your case, the cutpoints. It is tedious. There is an example in the multivariate probit case at