# Multivariate normal with highly correlated data

#1

Hi, I’m having issues fitting a simple multivariate normal model. The model works well for many datasets, but not when input data is highly correlated, e.g., cor(x, y) = 0.9. Highly correlated data happens to be the scenario I’m developing a model for.

Below a simplified example:

data{
int<lower = 0> n;
int<lower = 0> p;
matrix[n, p]   x;
// hyperparameters
real<lower = 0> priorEta;
real<lower = 0> priorSdx;
}
transformed data{
vector[p] zeros = rep_vector(0, p);
}
parameters{
cholesky_factor_corr[p] Lcorr;
vector<lower=0>[p]      sdx;
}
model{

sdx   ~ cauchy(0, priorSdx);
Lcorr ~ lkj_corr_cholesky(priorEta);
for (i in 1:n) {
x[i, ] ~ multi_normal_cholesky(zeros, diag_pre_multiply(sdx, Lcorr));
}
}


And the R-code I use to run the model.

rm(list = ls())
library(rstan)
n <- 100
p <- 3

r <- matrix(c(1.0, 0.5, 0.9,
0.5, 1.0, 0.0,
0.9, 0.0, 1.0), 3, 3)
r <- as.matrix(Matrix::nearPD(r, corr = TRUE)$mat) sdx <- rep(1, p) s <- diag(sdx) %*% r %*% diag(sdx) x <- mvtnorm::rmvnorm(n, sigma = s) dataList <- list( x = x, n = n, p = p, priorEta = 1, priorSdx = 1 ) res <- stan(file = "stan/question/model.stan", data = dataList, chains = 2) ss <- summary(res) # Rhat is meh ss$summary[!is.na(ss$summary[ ,"Rhat"]), "Rhat"] # output Lcorr[2,1] Lcorr[2,2] Lcorr[3,1] Lcorr[3,2] Lcorr[3,3] sdx[1] 1.121353 1.123117 1.794683 1.794350 1.578169 1.703874 sdx[2] sdx[3] lp__ 1.239409 1.881612 1.085317 Warning messages: 1: There were 1861 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 2: Examine the pairs() plot to diagnose sampling problems  Some observations: If the correlations among predictors are approximately 0, a chain of 2000 samples takes about 10 seconds to run. If the correlations among predictors are high, this time can increase to about 500 seconds per chain! I’m aware I can rewrite the normal model in terms of the sufficient statistics, which leads to a good boost in performance, but it doesn’t improve the quality of the sampling (i.e., the R-hat is high). Is there any reparametrization trick, or another method to improve the efficiency of the sampler, that I’m missing? All comments, suggestions, or advice is welcome! #2 Try a higher max treedepth #3 Thanks! I suppose that works, although it’s not the solution I was hoping for. I’m wondering if there exists something more along the lines of this: http://mc-stan.org/users/documentation/case-studies/qr_regression.html . Of course, it could be that such a transformation doesn’t exist for this scenario. #4 You can use that technique too, you just repeatedly pay the qr cost #5 Ok a question. You have defined a set of x and y variables in your r code, but only the x variables are fed into Stan - what exactly are you trying to model here? Simply the correlations amongst x? #6 Ah my bad, that’s an artifact from a possible expansion where the x variables (predictors) are related to y (a target variable). While playing around with the model, I noticed that fitting a model where x and y are unrelated already gave me difficulties when x is highly correlated (hence the question). I’ll edit and remove the references to y. #7 Ok my next question. Why are you regressing versus “zeros” ? #8 I’m not interested in the means, just the correlations and variances. I suppose I could model the means, but would that improve the sampling efficiency? Usually I just center the input and set the mean of the distribution to zero. #9 If you center the means you could also do some rotation s.t. the variables were not longer so correlated. #10 Ah ok. But what I’m wondering is if the general form of the model is: Posterior ~ model(data)*prior By using model(data) = 0, are you not defining the whole RHS = 0 ? (I could be wrong here - but this is what I understand) #11 That’s what I am looking for. I suppose I could introduce X^TX = \Sigma = Q\Lambda Q^T (eigendecomposition) and then use W = XQ\Lambda^{-\frac{1}{2}} such that W^TW = I. But then if I model W\sim N(0, \text{diag}(\sigma)LL^T\text{diag}(\sigma)), how do I transform \text{diag}(\sigma) and L into a and B such that these are the parameters of X \sim N(0, \text{diag}(a)BB^T\text{diag}(a))? I realize this happens in the transformed parameters block, but I just don’t know the specific transformations I should apply. #12 I think it just means that P(\sigma, L \mid X) \propto N(X\mid 0, \text{diag}(\sigma)LL^T\text{diag}(\sigma))P(\sigma)P(L), where \sigma are the standard deviations and L is the cholesky of the correlation matrix. The zero means don’t affect the likelihood much as it implies: (x_i - \mu)^T(\text{diag}(\sigma)LL^T\text{diag}(\sigma))^{-1}(x_i-\mu) =x_i^T(\text{diag}(\sigma)LL^T\text{diag}(\sigma))^{-1}x_i. So the inner product of the normal distribution simplifies a little bit. #13 But your model above does not have any xi’ on the RHS. You only have x on the left. So then your xi - u term simplifies to -u = 0, leaving Stan with 0 fixed on the right hand side and therefore it is exploring infinite parameter space (I believe). If instead of zeros you allow a parameter b with prior defining mean =0, then it can explore a non-infinite parameter space (I think). So for example: data{ int<lower = 0> n; int<lower = 0> p; matrix[n, p] x; // hyperparameters real<lower = 0> priorEta; real<lower = 0> priorSdx; } //transformed data{ // vector[p] zeros = rep_vector(0, p); //} parameters{ cholesky_factor_corr[p] Lcorr; vector<lower=0>[p] sdx; vector[p] b; } model{ b ~ normal(0,1); sdx ~ cauchy(0, priorSdx); Lcorr ~ lkj_corr_cholesky(priorEta); for (i in 1:n) { x[i, ] ~ multi_normal_cholesky(b, diag_pre_multiply(sdx, Lcorr)); } } generated quantities{ matrix[p, p] r_mat; // Generate correlation matrix from Lcorr r_mat = multiply_lower_tri_self_transpose(Lcorr); }  I ran this and then this to recover the correlation matrix: res <- stan(file = "Stan models/forum_question.stan", data = dataList, chains = 4, iter=2000) ss <- summary(res) params <- extract(res) > sapply(1:p, function(y) sapply(1:p, function(x) median(params$r_mat[, x, y])) )
[,1]      [,2]      [,3]
[1,] 1.0000000 0.5672136 0.8814693
[2,] 0.5672136 1.0000000 0.1058976
[3,] 0.8814693 0.1058976 1.0000000



This is crudely correct, but took a long time. So I do a traceplot of the betas:

Oooo thats a bit ugly. So we found roughyl the right answers, but it took a logn time and betas are wandering around in space a bit much. I think we can fix this with a non-centered parameterisation… give me a few mins to implement that.

#14

Ok edit on this post.
Non-centered parameterisation didn not help. I just ran the centered version from one post up with tighter priors and it also didn’t help. I’m sorry I am don’t know how write latex equations - but I think your problem is that your model code only has parameters and priors on the RHS - but no data. All your data is on the left hand side of the model equation, and so I think identifiability problems are inescapable unless you change that. This is just what I think looking at your (and my) code above. So I don’t think your code corresponds to the equations that you wrote above. But I may well be wrong - I’m only learning this stuff too.

#15

I have had difficulty with multivariate normals in the past with Stan.

Here’s something I think I’ve never tried: Stan documentation often recommends centering and standardizing variables so that they are mean of 0 and standard deviation of 1. What about going another step further and standardizing so they have zero correlation as well (using cholesky of sample covariance matrix). Then, fit a multivariate normal on this transformed data set and adjust the estimate of the covariance matrix.

#16

Thanks for the suggestion! I considered this, but I failed at working out the Jacobian of the transformation. The crucial part of the transformation is L\text{diag}(1 / \sigma)R^{-1}, where L is the Cholesky of the sample covariance matrix, \text{diag}(1 / \sigma) is a diagonal matrix containing the inverse of the population standard deviations, and R^{-1} is the inverse of the Cholesky of the population correlation matrix.

If I want to sample from a standard normal I need to introduce the change of variables |J(\text{diag}(1 / \sigma)R^{-1}\rightarrow L\text{diag}(1 / \sigma)R^{-1})|. I can express L\text{diag}(1 / \sigma)R^{-1}) symbolically, but I’m not sure how to obtain the Jacobian of the transformation. I have a feeling the Jacobian will be triangular, so its determinant is just the product of the diagonal values, but that’s where I got stuck.

#17

So by chance I had to do the same thing myself just now for my own work and took a fresh attempt at it. Try this:

data{
int<lower=0> NvarsX;   // number variables
int<lower=0> N;        // Total number of rows
vector[NvarsX] x [N];  // data for independent vars
}

parameters{
cholesky_factor_corr[NvarsX] L_Omega; // L form of correlation matrix
vector<lower=0>[NvarsX] L_sigma;      // Vector of scales for covariance matrix
vector[NvarsX] mu;              // vector of mean value for each variable
}

transformed parameters{
matrix[NvarsX, NvarsX] L_Sigma;
L_Sigma = diag_pre_multiply(L_sigma, L_Omega);    // L form of Sigma covariance matrix
}

model{
//Priors
L_Omega ~ lkj_corr_cholesky(1);
L_sigma ~ normal(0, 5);

//likelihood
x ~ multi_normal_cholesky(mu, L_Sigma);
}

generated quantities{
matrix[NvarsX, NvarsX] Omega;   // Square form of correlation matrix
matrix[NvarsX, NvarsX] Sigma;   // Square form of covariance matrix

// Generate correlation matrix
Omega = multiply_lower_tri_self_transpose(L_Omega);

fit1 = stan(file= "model1.stan", data=datalist, iter=2000, chains = 4);
sapply(1:NvarsX, function(x) sapply(1:NvarsX, function(y) format( mean(params$Omega[, x, y]) , digits=3) ) )  Edit: -sorry the R code will give you the correlation matrix. Change $Omega to \$Sigma` and you shoudl get covariance.