Learning a correlation matrix

kon · September 11, 2017, 5:28pm

Hi,

I am trying to fit a latent variable model for which I want to learn a correlation matrix.
I followed the advice on Stan manual in chapter 9.15 and targeted the cholesky factor of the correlation matrix. However, the histogram of the posterior samples are very wide and centered around the wrong value. Specifically, in my example I am trying to learn just 1 correlation coefficient. The real coefficient I use to generate the data is 0.6 but the posterior sample mean is 0.
I would greatly appreciate your help in improving this model. I am suspecting I have a mistake somewhere. Is the LKJ prior pulling the posterior towards 0 too much?

Here is my model.stan file:

data{
 int<lower=0> N;
 int<lower=0> K;
 int<lower=0, upper=1> y[N,K];
}


parameters {
  vector[K] mu;
  vector[K] z[N];
  cholesky_factor_corr[K] L_R;
  
}

model{
  L_R ~ lkj_corr_cholesky(4);
  mu ~ normal(0,1);
  z  ~ multi_normal_cholesky(mu, L_R);
  for (k in 1:K) y[,k] ~ bernoulli_logit(z[,k]);
  
}

generated quantities{
  corr_matrix[K] R;
  R = multiply_lower_tri_self_transpose(L_R);
}

And here is the R script to generate data and fit the model:

library(gtools)
library(MASS)
set.seed(68234934)
N <-300
K<-2

rho <- 0.6
R <- matrix(c(1, rho, rho, 1),2,2)
mu <- c(0.7,0)
z <- mvrnorm(n = N, mu , R)
p <- inv.logit(z, min = 0, max = 1)
y = sapply(1:N, function(n) rbinom(size=1, 2,  prob = p[n,]))
y<- t(y)
stan_rdump(c("N", "K", "y"), file="./model.data.R")
input_data <- read_rdump("./model.data.R")
fit <- stan(file='./model.stan',
            control = list('adapt_delta' = 0.8 , 'max_treedepth' = 15),
            data= input_data,
             seed=4938483)

Heading

bgoodri · September 11, 2017, 5:59pm

No, the LKJ prior is pulling the posterior towards 0 by exactly the amount that is consistent with the data. For 300 observations, that is quite a bit when you set the shape parameter of the LKJ distribution equal to 4. You see a lot of examples in Stan programs of large shape parameters, but I almost never go below 1 or above 2. If you set it to 1, the posterior mean will be pretty close to 0.6 in your simulation.

kon · September 11, 2017, 6:15pm

Ben,

Thank you for the suggestion. I think the issue here is that there is not enough information to learn R, even with N=300.
I am guessing that’s the cause of the Bayesian Fraction of Missing Information was low error, is that is that a right way to interpret that?

bgoodri · September 14, 2017, 10:45pm

The BFMI is really a reflection on the model as a whole rather than a particular part of it.

kon · September 14, 2017, 10:50pm

I see - thanks for the comments!

kon · December 5, 2017, 2:41pm

Should this work the same if I now have binary and continuous data and want to to learn the variance as well?
I’ve been working on the following model for a while (’./corr_model.stan’):

data{
 int<lower=0> N;
 int<lower=0> Kb;
 int<lower=0> Kc;
 int<lower=0, upper=1> yb[N,Kb];
 real yc[N,Kc];
}

transformed data{
  int<lower=1> K = Kb + Kc;
}
parameters {
  vector[Kb] mu_b;
  vector[Kc] mu_c;
  
  cholesky_factor_corr[K] L_R;
  matrix[N, K] Z_raw;
  vector<lower=0>[Kc] sigma;
}

model{
  matrix[N,K] Z = Z_raw * transpose(L_R);
  mu_b ~ normal(0,1);
  mu_c ~ normal(0,10);
  sigma ~ normal(0, 5);
  L_R ~ lkj_corr_cholesky(1);
  to_vector(Z_raw) ~ normal(0,1);
  for (k in 1:Kb) yb[,k] ~ bernoulli_logit(mu_b[k] + Z[,k]);
  for (k in 1:Kc) yc[,k] ~ normal(mu_c[k] + Z[,k + Kb], sigma[k]);
}

generated quantities{
  corr_matrix[K] R = multiply_lower_tri_self_transpose(L_R);
}

My data is a 400 x 3 matrix Y with the first two columns being binary and the third column continuous. In this case, in addition to the latent matrix Z, the mean and correlation matrix, I also want to learn the variance of the third column of Y.

I generate data and run the model with the following code:

set.seed(68232134)
N <-400
K<-3
Kb <-2
Kc<-1
Rval <- c(1.        ,  -0.5 , -0.2,
          -.5        ,  1 , .8 ,
          -.2, .8, 1)

R <- matrix(Rval,3,3)
z <- mvrnorm(n = N, rep(0,K)  , R)
mu_b<- c(-.8 , -2.7)
mu_c <- c(-3.3)
mu <- c(mu_b, mu_c)
sigma <- c(.8)

zb <- matrix(0, nrow = N, ncol = Kb)
for(n in 1:N){
  zb[n,] <- z[n,1:Kb] + mu_b
}
zc <- matrix(0, nrow = N, ncol = Kc)
for(n in 1:N){
  zc[n,] <- z[n,(Kb+1):K] * sigma + mu_c
}
p <- plogis(zb)
yb = sapply(1:N, function(n) rbinom(size=1, Kb,  prob = p[n,]))
yb<- t(yb)
yc <- zc

fit <- stan(file='./corr_model.stan',
            data=list(N = N, K = K, Kb = Kb, Kc = Kc, yb = yb, yc = yc),
            control = list('adapt_delta' = 0.98 , 'max_treedepth' = 10),
            iter=2000,
            chains = 4,
            seed=4938483)

I am observing the following behavior:

This model fits well and find the true parameters when true sigma is approx > 2
It throws fitting errors and underestimates sigma when true sigma is approx < 2 and gets worse as it approaches 0.

What could be a reason for this behavior?
Can this be fixed by adjusting the priors on sigma or the other parameters?(In particular, I don’t have a good intuition about the LKJ prior)
Or is the problem somewhere else?

Any help would be much appreciated.

Thanks,

Bob_Carpenter · December 6, 2017, 4:39am

I’m not sure which (co)variance you want to estimate. Is that just the sigma in your Stan program?

What kind of errors are you seeing? Have you figured out what’s causing them?

You can’t test if you get the same values for calibration, you have to look at interval coverage.

This is a no-op with eta = 1 as it’s just a uniform distribution over Cholesky factors for correlation matrices. If you increase eta above 1, it favors unit correlation matrices (i.e., no or low correlations).

Your priors look OK—you are generating test data that’s much more narrowly distributed than your priors would entail, but it’s unlikely the priors are the problem here.

kon · December 6, 2017, 1:58pm

Bob,

Thank you for replying.

Yes! Here sigma is a number but it can be vector of dim K (see corr_model.stan).

Running the above code gives the following errors:

Warning messages:
1: There were 9 divergent transitions after warmup. Increasing adapt_delta above 0.98 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 
2: There were 2002 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
3: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low 
4: Examine the pairs() plot to diagnose sampling problems

I haven’t. Any help would be much appreciated.

The posterior intervals cover the true parameters (good!) for R and mu but are way off for sigma.

> print(fit, pars = 'R')
Inference for Stan model: corr_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd  2.5%   25%   50%  75% 97.5% n_eff Rhat
R[1,1]  1.00    0.00 0.00  1.00  1.00  1.00 1.00  1.00  4000  NaN
R[1,2]  0.05    0.01 0.31 -0.59 -0.17  0.05 0.28  0.63   527 1.00
R[1,3] -0.09    0.01 0.13 -0.34 -0.19 -0.09 0.00  0.16   141 1.04
R[2,1]  0.05    0.01 0.31 -0.59 -0.17  0.05 0.28  0.63   527 1.00
R[2,2]  1.00    0.00 0.00  1.00  1.00  1.00 1.00  1.00  3904 1.00
R[2,3]  0.81    0.03 0.13  0.52  0.73  0.83 0.91  0.98    14 1.14
R[3,1] -0.09    0.01 0.13 -0.34 -0.19 -0.09 0.00  0.16   141 1.04
R[3,2]  0.81    0.03 0.13  0.52  0.73  0.83 0.91  0.98    14 1.14
R[3,3]  1.00    0.00 0.00  1.00  1.00  1.00 1.00  1.00  4000 1.00

> print(fit, pars = 'sigma')
Inference for Stan model: corr_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
sigma[1]  0.1    0.02 0.04 0.02 0.07 0.11 0.13  0.19     5 1.55


> print(fit, pars = 'mu_b')
Inference for Stan model: corr_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu_b[1] -0.69       0 0.12 -0.92 -0.78 -0.70 -0.61 -0.45  4000    1
mu_b[2] -2.27       0 0.17 -2.60 -2.38 -2.27 -2.16 -1.95  4000    1

> print(fit, pars = 'mu_c')
Inference for Stan model: corr_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
mu_c[1] -3.28       0 0.05 -3.38 -3.32 -3.28 -3.25 -3.18   800    1

One clue for figuring out what’s going is that when true sigma is bigger, say >2 , fit diagnostics are clean and the posteriors cover the true values for all parameters. The above problems (fitting errors and the posterior coverage for sigma misses the true value) seem to appear when sigma gets smaller and intensify as true sigma approaches zero. Above I demonstrated these problems with true sigma = 0.8.

Hope this helps - thanks again for the reply!

Bob_Carpenter · December 13, 2017, 9:44pm

If the model’s very complicated, you may need to increase the maximum tree depth. What’s happening is that you’re converging to a very small step size and it’s taking a lot of steps to move across the posterior.

If you only have a handful of divergences, increasing adapt_delta can help (but that will cause even more growth in tree depth).

Often, these problems can be mitivated or even solved with reparameterizations that reduce posterior curvature.

It may also help to run more iterations. If you run more warmup iterations, Stan has more of a chance to adapt. Also, you want to know if that n_eff of 14 will go up with more iterations. If so, those parameters are mixing, just slowly, and more iterations will bring down Rhat.

What happens when scale parameters shrink is that they increase the curvature for any parameters depending on them. The classic funnel in hierarchical models is the standard example—when hierarchical scale is low, the parameters have to all be near one another. Reparameterizing with the non-centered parameterization helps immensely and should help in your case if there’s a hierarchical model.

Topic		Replies	Views
Sparse sampling for a Cholesky correlation matrix (or modelling correlations among some, but not all parameters) Modeling	15	1248	February 1, 2019
Question on cholesky_factor_corr and the resulting correlation matrix Modeling	4	1096	January 8, 2020
Issues with cholesky_factor_corr Matrices: Estimated as 1 During Model Fitting, Exceeding [-1,1] in Predictive Sampling Modeling rstan	2	46	February 11, 2025
Covariance prior via lkj_corr_cholesky Modeling fitting-issues , covariance	4	536	December 21, 2023
Tip for Fitting Multivariate Normal Modeling	0	1606	August 9, 2019

Learning a correlation matrix

Heading

Related topics