Underestimating correlation coefficients with LKJ prior

Hi all

Apologies if this is more of a generic modelling than Stan question but I couldn’t find much information elsewhere.

I want to estimate the correlation among different variables, effectively analogous to what is described here: http://www.sumsar.net/blog/2013/08/bayesian-estimation-of-correlation/. Certainly aware that there are easier ways to do this but this is a much simplified example of what I eventually want to do.

To illustrate my question, consider the following model that I have fit using RStan 2.9 (I know, I’m slow to adopt), where I know the mean and SD:

> load(url('https://dl.dropboxusercontent.com/s/j05l2gvgwy6chil/STANdata.Rdata'))
cormodel <- "
data{
  int<lower=0> N;
  matrix[18,N] y;
  vector<lower=0>[N] sdd;
  vector[N] muu;
}
parameters{
  cholesky_factor_corr[N] LOmega;
}
model{
  LOmega ~ lkj_corr_cholesky(2);
  for (i in 1:18)
   y[i] ~ multi_normal_cholesky(muu, diag_pre_multiply(sdd,LOmega));
}
generated quantities {
  matrix[N,N] Omega;
  Omega <- multiply_lower_tri_self_transpose(LOmega);
}
"
fit_model <- stan(model_code = cormodel, data = STANdata, pars = 'Omega',  chains = 3, iter = 1000, warmup = 500, thin = 4, refresh = 100, cores = 3, open_progress = T)

My problem is that this model consistently underestimates the correlations in the data. For example, median(extract(fit_model,‘Omega[5,6]’)[[1]]) will often be about 0.85 when cor(STANdata$y)[5,6] = 0.96. The reason that I say that this might be a modelling rather than Stan q, is that I get very good (effectively spot on) estimates of the correlation when I reduce the data to a bivarate matrix as in the above blog post. But, as I add variables, all of the correlation coefficients become smaller and tend to zero (the problem isn’t of course limited to the only correlation mentioned above). I have tried playing with priors etc… but those don’t seem to make a difference. I suspect that there is a rather obvious explanation that I’m missing.

Thanks in advance for any insights you can offer!

[edit: code markdown]

1 Like

That is what is supposed to happen under an LKJ prior. It is easier to think about the case with a shape parameter of 1. When dealing with a 2x2 matrix, that implies the lone correlation has a uniform distribution between -1 and 1. Thus, if the data are consistent with a correlation of 0.96 or something really high, then the posterior distribution will be very skewed but mostly close to 1.

In general, when there are K variables, then the marginal distribution of a single correlation is Beta on the (-1,1) interval with both shape parameters equal to K / 2. Thus if K > 2, this is going to put zero prior density on the points -1 and 1 and have a mean / median / mode at zero. This shrinkage toward the identity matrix gets stronger as K gets larger.

So, if you think that there will be really large correlations among several variables, then the LKJ distribution is probably not the right choice. But we do not have anything better in general, although squared distance correlation functions and the like might work better in some cases.

1 Like

Thanks so much for the quick and clear reply Ben. This is what I suspected, i.e. shrinkage toward the identity matrix is supposed to happen as K gets larger.

I guess the challenging thing is that, in my case, there are some large correlations among variables, so what is the better strategy? Are there streamlined examples of the squared distance approach in Stan?

Otherwise, I guess you are referring to using the squared_distance functions in Stan to derive the distance covariance and standard deviations (as on here: https://en.wikipedia.org/wiki/Distance_correlation)?

No, I meant like cov_exp_quad in Stan.

Apologies for another reply but I thought others might be interested in this as well. It looks like in the above example that the scaled inverse-Wishart actually performs better… I didn’t even think to try it given the LKJ.

Inverse Wishart with small degrees of freedom does put more prior mass in corners of the parameter space. Usually, that is not a good thing substantively or computationally. But for a covariance matrix of fixed data, it might work decently.

How was the data generated? You only get guarantees of calibration when the model matches the data-generating process.

I alwasy tell my co-developers there are people like you. The reasons for upgrading are improved efficiency and robustness, and we keep adding new functions and capabilities. The reasons not to upgrade are that it can be painful and the fear of new bugs can outweigh the fear of existing bugs.

Declare

data {
  vector[18] y[N];
...
model {
  y ~ multi_normal_cholesky(...);

It vectorizes so that you only have to do the diagonal pre-multiplication and factorization once. This is a huge speedup, especially with large N.

You can also use LOmega ~ lkj_corr_cholesky(1), which is a uniform prior on LOmega. You can even go below 1.

Same issue about vectorization with Wisharts, but we don’t have an efficient Cholesky-factor parmaterization for those as we tend not to use them.

Ben, those were really useful comments above.

A common problem I deal with is that the the first eigenvalue of the sample correlation matrix will be very large relative, much larger than random matrix theory would predict. One solution I’ve tried is using a factor structure for the correlation, but it’s not a panacea. I’ve tried the lkj distribution on this type of matrix, but it is really struggles for me when the number of variables increases. One thing I had noticed was that the determinant of the sample correlation matrix quickly goes very small as I add more variables to it. About 10^-12 by the end. Not sure how significant that was when looking at the formula for the lkj distribution in the Stan documentation.

Anyway, with respect to your point above about how the LKJ prior shrinks an individual correlation to 0, would there be any way for it to shrink it towards a specific value. For instance, if they shrink toward some value c, which may be 0.5 or -0.25 or whatever. It looks like the lkj prior is based on an beta(alpha, alpha) distribution that is then scaled to the (-1,1) interval (2 * beta - 1, I take it). So if I want a c mean correlation, then that would be an average of (c+1)/2=C for the beta distribution. That would imply a distribution of beta(alpha * (C / (1-C)), alpha). This makes sense to me for an individual correlation, but I don’t understand how it would generalize for a correlation matrix.

2 Likes

I think you want a convex combination of a correlation matrix that is LKJ with shape 1 and a correlation matrix with the same correlation in all non-diagonal cells.

Thanks. I was actually just thinking about running the following. I think it will do the same thing.

data {
     int<lower=0> N;
     int<lower=0> T;

     vector[N] X[T];
}
transformed data {
    vector[N] zeroes;
    vector[N] ones;
    matrix[N, N] ones_t_ones;
    matrix[N, N] eye;

    zeroes = rep_vector(0.0, N);
    ones = rep_vector(1.0, N);
    ones_t_ones = ones * ones';
    eye = diag_matrix(ones);
}
parameters {
    simplex[3] p;

    cholesky_factor_corr[N] C_input;
}
transformed parameters {
    cholesky_factor_corr[N] C;

    C = p[1] * eye + p[2] * ones_t_ones + p[3] * C_input;
}
model {
    C_input ~ lkj_corr_cholesky(2);
    X ~ multi_normal(zeroes, C)
}

A few errors fixed below. I just ran it on the scaled data (zero mean, 1 standard deviation) and it worked really well. Like way better than anything else I’ve tried.

data {
     int<lower=0> N;
     int<lower=0> T;

     vector[N] X[T];
}
transformed data {
    vector[N] zeroes;
    vector[N] ones;
    matrix[N, N] ones_t_ones;
    matrix[N, N] eye;

    zeroes = rep_vector(0.0, N);
    ones = rep_vector(1.0, N);
    ones_t_ones = ones * ones';
    eye = diag_matrix(ones);
}
parameters {
    simplex[3] p;

    cholesky_factor_corr[N] C_input;
}
transformed parameters {
    matrix[N, N] C;

    C = p[1] * eye + p[2] * ones_t_ones + p[3] * (C_input * C_input');
}
model {
    C_input ~ lkj_corr_cholesky(1);
    X ~ multi_normal(zeroes, C);
}

Just do C = p[3] * C_input; and then fix up the elements of C without incurring all the autodiff from multipling and adding.

Updated part of the code below. A bit faster, enough to keep it this way. I imagine it would have bigger speedups for bigger problems (my current version has T=300, N=23). Naively using loops was about 5x slower though.

parameters {
    simplex[3] p;

    cholesky_factor_corr[N] C_input;
}
transformed parameters {
    matrix[N, N] C;

    //C = p[1] * eye + p[2] * ones_t_ones + p[3] * (C_input * C_input');
    C = p[1] * (C_input * C_input');
    C = C + p[2];
    for (i in 1:N)
    {
        C[i, i] = C[i, i] + p[3];
    }
}

Use: multiply_lower_tri_self_transpose(C_input)

This is redundant as it’s uniform.

Use: rep_matrix(p[2], N, N) in place of p[2] * ones_t_ones

All this arithmetic in Stan is costly as we have to chase the chain rule through all the multiplies as soon as a parameter like p[2] gets involved. Just copying the matrix saves N x N multiplies during construction and just as many chain rule passes (and virtual function calls) and multiplies in the reverse pass.

Yup, what @bgoodri said. I didn’t see this. I put a big warning flag on the function in the manual for just this reason! You want to do this:

transformed parameters {
  matrix[N, N] C
     = multiply_lower_tri_self_transpose(C_input)
       + rep_matrix(p[3], N, N);
  for (n in 1:N) C[n, n] += 1;

That’s not quite optimal in terms of autodiff structure, but it’ still readable. It’ll probably be a bit faster to use this:

  matrix[N, N] C
     = multiply_lower_tri_self_transpose(C_input);
  {
    real p_3_p1 = p[3] + 1;
    for (n in 1:N)  C[n, n] += p_3_p1;
    for (n in 1:N) {
      for (m in 1:n-1) C[n, m] += p[3];
      for (m in n+1:N) C[n, m] += p[3];
    }

for (n in 1:N) C[n, n] = 1;
since C is a correlation matrix by construction

The original user model added 1 after expanding the Cholesky factor, so you’d need += 1 to match that if it was intentional (that was the + eye).

The other problem with setting this to 1 would be that it would lose derivative information from the multiply-self-transpose. But maybe there isn’t any. Maybe I’ll ask @bgoodri :-)

@Bob_Carpenter I appreciate the replies. I had just wanted to get something working. Raw use of the lkj prior with this type of data was a big failure that took a long time for me to figure out. So I was glad to just get something producing sensible results before optimizing. For instance, my more recent versions had used tcrossprod instead of (C_input * C_input’). I didn’t notice much difference when replacing that with multiply_lower_tri_self_transpose, but I’m using a slightly older version of rstan (2.12-ish, also some of the syntax in your versions does not work for me, I’m slow to update!). Along the same lines, I noticed the program ran quite a bit faster when I used the explicit lkj_corr_cholesky prior, even if you say it’s redundant, but perhaps that is due to my older version of Stan. My latest version is below. Largely the same as yours, with just the syntax changed to work on my version and a bug fix or two.

One thing I thought I’d bring up is that this is being run on data standardized to have zero mean and standard deviation of 1. I get the best performance in this case. Where it gets slower is when I relax those assumptions and try to estimate the mean and standard deviation as well. The results still make sense, just slower. And really, it’s the standard deviation that’s the issue. When I adjust the scale function so that it is still centering, but no longer setting standard deviation to 1, the time Stan takes for each chain is about 10 times longer on the most recent run compared to the version with just a correlation matrix. Leaving the mean, but scaling so standard deviation is one, takes only about 50% longer, which is reasonable enough for me to not mention. The changes are relatively minor (also below). I think the issue is the correlation between the standard deviation parameters, but I can’t think of an easy solution right now. I might have tried scaling the data in Stan at one point, but I forget the results now.

Version with zero mean, 1 standard deviation:

data {
     int<lower=0> N;
     int<lower=0> T;

     vector[N] X[T];
}
transformed data {
    vector[N] zeroes;

    zeroes = rep_vector(1.0, N);
}
parameters {
    simplex[3] p;

    cholesky_factor_corr[N] L_Omega;
}
transformed parameters {
    matrix[N, N] C;
    
    C = p[1] * multiply_lower_tri_self_transpose(L_Omega);
    {
        real p2_p3;
        p2_p3 = p[2] + p[3];

        for (n in 1:N) {
            C[n, n] = C[n, n] + p2_p3;
            for (m in 1:n-1) C[n, m] = C[n, m] + p[2];
            for (m in n+1:N) C[n, m] = C[n, m] + p[2];
        }
    }
}
model {
    L_Omega ~ lkj_corr_cholesky(1); //nu=1, allegedly redundant as uniform, but faster

    X ~ multi_normal(zeroes, C);
}

Version with zero mean, but estimate standard deviations:

data {
     int<lower=0> N;
     int<lower=0> T;

     vector[N] X[T];
}
transformed data {
    vector[N] zeroes;

    zeroes = rep_vector(1.0, N);
}
parameters {
    simplex[3] p;
    cholesky_factor_corr[N] L_Omega;
    vector<lower=0>[N] sigma;
}
transformed parameters {
    matrix[N, N] C;
    cov_matrix[N] Sigma;
    
    C = p[1] * multiply_lower_tri_self_transpose(L_Omega);
    {
        real p2_p3;
        p2_p3 = p[2] + p[3];

        for (n in 1:N)
        {
            C[n, n] = C[n, n] + p2_p3;
            for (m in 1:n-1) C[n, m] = C[n, m] + p[2];
            for (m in n+1:N) C[n, m] = C[n, m] + p[2];
        }
    }
    Sigma = quad_form_diag(C, sigma);
}
model {
    sigma ~ cauchy(0.05, 0.05);

    L_Omega ~ lkj_corr_cholesky(1); //nu=1, allegedly redundant as uniform, but faster

    X ~ multi_normal(zeroes, Sigma);
}
1 Like

@Bob_Carpenter Arg, never mind my point about speed differences (at least in the code above). I had put zeroes as being ones!

Hello!

I’ve read this thread, and I think it adresses a problem I encounter when I try to fit multivariate regressions.

Some components of L_Omega, the cholesky factor of my correlation matrix, go to zero and my chains do not explore the posterior space (rhat = NAN).

Is this behaviour due to the cholesky prior, or did I make mistakes in my models?

Here is my model :

require(rstan)
my_model ← stan_model(model_code =
"data {
int<lower=1> K;
int<lower=1> J;
int<lower=0> N;
vector[J] X[N];
vector[K] Y[N];
}
parameters {
matrix[K, J] beta;
cholesky_factor_corr[K] L_Omega;
vector<lower=0>[K] L_sigma;
}
model {
vector[K] mu[N];
matrix[K, K] L_Sigma;

for (n in 1:N)
mu[n] = beta * X[n];

L_Sigma = diag_pre_multiply(L_sigma, L_Omega);
to_vector(beta) ~ normal(0, 5);
L_Omega ~ lkj_corr_cholesky(1);
L_sigma ~ cauchy(0, 2.5);

Y ~ multi_normal_cholesky(mu, L_Sigma);
}
generated quantities{
matrix[K, K] L_Sigma;
L_Sigma = diag_pre_multiply(L_sigma, L_Omega);
}")

And here is a process generating a dataset similar to mine :

require(mvtnorm)

Number of plot sampled

n_plot = 20

Number of individuals sampled in each plots

n_ind = 40

Matrix of predictors

X ← rmvnorm(n_plot*n_ind, mean = c(0,0), sigma = matrix(c(1,0.07,0.07,1), ncol = 2))

Matrix of regression coefficients

Beta ← matrix(c(
0.42, -0.04,
0.65,-0.05,
0.51,-0.03,
-0.70, -0.12,
-0.38, 0.10,
0.11,-0.01,
-0.04,-0.13
), ncol = 2, byrow = T)

Compute the deterministic mean for each observation

mu ← matrix(NA, nrow = nrow(X), ncol = nrow(Beta))
for(n in 1:nrow(X)) mu[n,] ← Beta %*% X[n,]

Create correlation matrix

Sigma ← matrix(c(1.00, 0.67, -0.09, -0.20, -0.15, -0.36, 0.25,
0.67, 1.00, 0.23, -0.49, -0.18, -0.13, 0.04,
-0.09, 0.23, 1.00, -0.63, -0.43, 0.43, -0.52,
-0.20, -0.49, -0.63, 1.00,0.47,-0.17, 0.15,
-0.15, -0.18, -0.43, 0.47, 1.00, 0.01, 0.17,
-0.36, -0.13, 0.43, -0.17, 0.01, 1.00, -0.29,
0.25, 0.04, -0.52, 0.15, 0.17, -0.29, 1.00), ncol = 7)

Compute observations based on multivariate normal distribution

Y ← matrix(NA, nrow = nrow(mu), ncol = ncol(mu))
for(n in 1:nrow(mu)) Y[n,] ← rmvnorm(1, mu[n,], sigma = Sigma)

pairs(cbind(Y,X))

Create objects to pass to stan

Dimensions

N ← nrow(Y) ## Number of observations
K ← ncol(Y) ## Number of response variables
J ← ncol(X) ## Number of predictors

Data_stan ← list(Y = Y, X = X,
N = N, K = K, J = J)

Fit the model

mod_reparam ← sampling(my_model,
data = Data_stan,
chains = 2, cores = 2, iter = 1000)

Check diagnostics

check_all_diagnostics(mod_reparam)
print(mod_reparam)

I do not have that kind of trouble if I do this analysis directly with the non-reparametred Sigma, that is, without doing Cholesky decomposition.

Thank you!

Lucas