Multivariate normal with highly correlated data

Hi, I’m having issues fitting a simple multivariate normal model. The model works well for many datasets, but not when input data is highly correlated, e.g., cor(x, y) = 0.9. Highly correlated data happens to be the scenario I’m developing a model for.

Below a simplified example:

data{
	int<lower = 0> n;
	int<lower = 0> p;
	matrix[n, p]   x;
	// hyperparameters
	real<lower = 0> priorEta;
	real<lower = 0> priorSdx;
}
transformed data{
  vector[p] zeros = rep_vector(0, p);
}
parameters{
  cholesky_factor_corr[p] Lcorr;
  vector<lower=0>[p]      sdx;
}
model{

  sdx   ~ cauchy(0, priorSdx);
  Lcorr ~ lkj_corr_cholesky(priorEta);
  for (i in 1:n) {
    x[i, ] ~ multi_normal_cholesky(zeros, diag_pre_multiply(sdx, Lcorr));
  }
}

And the R-code I use to run the model.

rm(list = ls())
library(rstan)
n <- 100
p <- 3

r <- matrix(c(1.0, 0.5, 0.9,
              0.5, 1.0, 0.0,
              0.9, 0.0, 1.0), 3, 3)
r <- as.matrix(Matrix::nearPD(r, corr = TRUE)$mat)
sdx <- rep(1, p)

s <- diag(sdx) %*% r %*% diag(sdx)
x <- mvtnorm::rmvnorm(n, sigma = s)

dataList <- list(
  x = x,
  n = n,
  p = p,
  priorEta = 1,
  priorSdx = 1
)

res  <- stan(file = "stan/question/model.stan",  data = dataList, chains = 2)
ss  <- summary(res)

# Rhat is meh
ss$summary[!is.na(ss$summary[ ,"Rhat"]), "Rhat"]
# output
Lcorr[2,1] Lcorr[2,2] Lcorr[3,1] Lcorr[3,2] Lcorr[3,3]     sdx[1] 
  1.121353   1.123117   1.794683   1.794350   1.578169   1.703874 
    sdx[2]     sdx[3]       lp__ 
  1.239409   1.881612   1.085317 
Warning messages:
1: There were 1861 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
2: Examine the pairs() plot to diagnose sampling problems

Some observations:

If the correlations among predictors are approximately 0, a chain of 2000 samples takes about 10 seconds to run. If the correlations among predictors are high, this time can increase to about 500 seconds per chain! I’m aware I can rewrite the normal model in terms of the sufficient statistics, which leads to a good boost in performance, but it doesn’t improve the quality of the sampling (i.e., the R-hat is high).

Is there any reparametrization trick, or another method to improve the efficiency of the sampler, that I’m missing?
All comments, suggestions, or advice is welcome!

Try a higher max treedepth

Thanks! I suppose that works, although it’s not the solution I was hoping for. I’m wondering if there exists something more along the lines of this: http://mc-stan.org/users/documentation/case-studies/qr_regression.html . Of course, it could be that such a transformation doesn’t exist for this scenario.

You can use that technique too, you just repeatedly pay the qr cost

Ok a question. You have defined a set of x and y variables in your r code, but only the x variables are fed into Stan - what exactly are you trying to model here? Simply the correlations amongst x?

Ah my bad, that’s an artifact from a possible expansion where the x variables (predictors) are related to y (a target variable). While playing around with the model, I noticed that fitting a model where x and y are unrelated already gave me difficulties when x is highly correlated (hence the question). I’ll edit and remove the references to y.

1 Like

Ok my next question. Why are you regressing versus “zeros” ?

I’m not interested in the means, just the correlations and variances. I suppose I could model the means, but would that improve the sampling efficiency? Usually I just center the input and set the mean of the distribution to zero.

If you center the means you could also do some rotation s.t. the variables were not longer so correlated.

Ah ok. But what I’m wondering is if the general form of the model is:
Posterior ~ model(data)*prior
By using model(data) = 0, are you not defining the whole RHS = 0 ? (I could be wrong here - but this is what I understand)

That’s what I am looking for. I suppose I could introduce X^TX = \Sigma = Q\Lambda Q^T (eigendecomposition) and then use W = XQ\Lambda^{-\frac{1}{2}} such that W^TW = I. But then if I model W\sim N(0, \text{diag}(\sigma)LL^T\text{diag}(\sigma)), how do I transform \text{diag}(\sigma) and L into a and B such that these are the parameters of X \sim N(0, \text{diag}(a)BB^T\text{diag}(a))? I realize this happens in the transformed parameters block, but I just don’t know the specific transformations I should apply.

I think it just means that

P(\sigma, L \mid X) \propto N(X\mid 0, \text{diag}(\sigma)LL^T\text{diag}(\sigma))P(\sigma)P(L),

where \sigma are the standard deviations and L is the cholesky of the correlation matrix. The zero means don’t affect the likelihood much as it implies: (x_i - \mu)^T(\text{diag}(\sigma)LL^T\text{diag}(\sigma))^{-1}(x_i-\mu) =x_i^T(\text{diag}(\sigma)LL^T\text{diag}(\sigma))^{-1}x_i.
So the inner product of the normal distribution simplifies a little bit.

But your model above does not have any xi’` on the RHS. You only have x on the left. So then your xi - u term simplifies to -u = 0, leaving Stan with 0 fixed on the right hand side and therefore it is exploring infinite parameter space (I believe). If instead of zeros you allow a parameter b with prior defining mean =0, then it can explore a non-infinite parameter space (I think). So for example:

data{
	int<lower = 0> n;
	int<lower = 0> p;
	matrix[n, p]   x;
	// hyperparameters
	real<lower = 0> priorEta;
	real<lower = 0> priorSdx;
}
//transformed data{
//  vector[p] zeros = rep_vector(0, p);
//}
parameters{
  cholesky_factor_corr[p] Lcorr;
  vector<lower=0>[p]      sdx;
  
  vector[p] b;
}
model{
 b ~ normal(0,1);
  sdx   ~ cauchy(0, priorSdx);
  Lcorr ~ lkj_corr_cholesky(priorEta);
  for (i in 1:n) {
    x[i, ] ~ multi_normal_cholesky(b, diag_pre_multiply(sdx, Lcorr));
  }
}

generated quantities{
    matrix[p, p] r_mat;

    // Generate correlation matrix from Lcorr
    r_mat = multiply_lower_tri_self_transpose(Lcorr);
}

I ran this and then this to recover the correlation matrix:

res  <- stan(file = "Stan models/forum_question.stan",  data = dataList, chains = 4, iter=2000)
ss  <- summary(res)
params <- extract(res)
> sapply(1:p, function(y) sapply(1:p, function(x) median(params$r_mat[, x, y])) )
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.5672136 0.8814693
[2,] 0.5672136 1.0000000 0.1058976
[3,] 0.8814693 0.1058976 1.0000000

This is crudely correct, but took a long time. So I do a traceplot of the betas:

Oooo thats a bit ugly. So we found roughyl the right answers, but it took a logn time and betas are wandering around in space a bit much. I think we can fix this with a non-centered parameterisation… give me a few mins to implement that.

Ok edit on this post.
Non-centered parameterisation didn not help. I just ran the centered version from one post up with tighter priors and it also didn’t help. I’m sorry I am don’t know how write latex equations - but I think your problem is that your model code only has parameters and priors on the RHS - but no data. All your data is on the left hand side of the model equation, and so I think identifiability problems are inescapable unless you change that. This is just what I think looking at your (and my) code above. So I don’t think your code corresponds to the equations that you wrote above. But I may well be wrong - I’m only learning this stuff too.

I have had difficulty with multivariate normals in the past with Stan.

Here’s something I think I’ve never tried: Stan documentation often recommends centering and standardizing variables so that they are mean of 0 and standard deviation of 1. What about going another step further and standardizing so they have zero correlation as well (using cholesky of sample covariance matrix). Then, fit a multivariate normal on this transformed data set and adjust the estimate of the covariance matrix.

Thanks for the suggestion! I considered this, but I failed at working out the Jacobian of the transformation. The crucial part of the transformation is L\text{diag}(1 / \sigma)R^{-1}, where L is the Cholesky of the sample covariance matrix, \text{diag}(1 / \sigma) is a diagonal matrix containing the inverse of the population standard deviations, and R^{-1} is the inverse of the Cholesky of the population correlation matrix.

If I want to sample from a standard normal I need to introduce the change of variables |J(\text{diag}(1 / \sigma)R^{-1}\rightarrow L\text{diag}(1 / \sigma)R^{-1})|. I can express L\text{diag}(1 / \sigma)R^{-1}) symbolically, but I’m not sure how to obtain the Jacobian of the transformation. I have a feeling the Jacobian will be triangular, so its determinant is just the product of the diagonal values, but that’s where I got stuck.

So by chance I had to do the same thing myself just now for my own work and took a fresh attempt at it. Try this:

data{
    int<lower=0> NvarsX;   // number variables
    int<lower=0> N;        // Total number of rows
    vector[NvarsX] x [N];  // data for independent vars
}

parameters{
    cholesky_factor_corr[NvarsX] L_Omega; // L form of correlation matrix
    vector<lower=0>[NvarsX] L_sigma;      // Vector of scales for covariance matrix
    vector[NvarsX] mu;              // vector of mean value for each variable
}

transformed parameters{
    matrix[NvarsX, NvarsX] L_Sigma;
    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);    // L form of Sigma covariance matrix
}

model{
    //Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ normal(0, 5);

    //likelihood
    x ~ multi_normal_cholesky(mu, L_Sigma);
}

generated quantities{
    matrix[NvarsX, NvarsX] Omega;   // Square form of correlation matrix
    matrix[NvarsX, NvarsX] Sigma;   // Square form of covariance matrix
    
    // Generate correlation matrix
    Omega = multiply_lower_tri_self_transpose(L_Omega);     
    Sigma = quad_form_diag(L_Omega, L_sigma);               
}

Successfully recovers the design covariance in my own test data so hopefully works for you too!
This R code should recover the covariance matrix:

fit1 = stan(file= "model1.stan", data=datalist, iter=2000, chains = 4);

params <- extract(fit1)
sapply(1:NvarsX, function(x) sapply(1:NvarsX, function(y) format( mean(params$Omega[, x, y]) , digits=3) ) )

Edit: -sorry the R code will give you the correlation matrix. Change $Omega to $Sigma and you shoudl get covariance.