Multivariate normal with highly correlated data

fitting-issues
performance

#1

Hi, I’m having issues fitting a simple multivariate normal model. The model works well for many datasets, but not when input data is highly correlated, e.g., cor(x, y) = 0.9. Highly correlated data happens to be the scenario I’m developing a model for.

Below a simplified example:

data{
	int<lower = 0> n;
	int<lower = 0> p;
	matrix[n, p]   x;
	// hyperparameters
	real<lower = 0> priorEta;
	real<lower = 0> priorSdx;
}
transformed data{
  vector[p] zeros = rep_vector(0, p);
}
parameters{
  cholesky_factor_corr[p] Lcorr;
  vector<lower=0>[p]      sdx;
}
model{

  sdx   ~ cauchy(0, priorSdx);
  Lcorr ~ lkj_corr_cholesky(priorEta);
  for (i in 1:n) {
    x[i, ] ~ multi_normal_cholesky(zeros, diag_pre_multiply(sdx, Lcorr));
  }
}

And the R-code I use to run the model.

rm(list = ls())
library(rstan)
n <- 100
p <- 3

r <- matrix(c(1.0, 0.5, 0.9,
              0.5, 1.0, 0.0,
              0.9, 0.0, 1.0), 3, 3)
r <- as.matrix(Matrix::nearPD(r, corr = TRUE)$mat)
sdx <- rep(1, p)

s <- diag(sdx) %*% r %*% diag(sdx)
x <- mvtnorm::rmvnorm(n, sigma = s)

dataList <- list(
  x = x,
  n = n,
  p = p,
  priorEta = 1,
  priorSdx = 1
)

res  <- stan(file = "stan/question/model.stan",  data = dataList, chains = 2)
ss  <- summary(res)

# Rhat is meh
ss$summary[!is.na(ss$summary[ ,"Rhat"]), "Rhat"]
# output
Lcorr[2,1] Lcorr[2,2] Lcorr[3,1] Lcorr[3,2] Lcorr[3,3]     sdx[1] 
  1.121353   1.123117   1.794683   1.794350   1.578169   1.703874 
    sdx[2]     sdx[3]       lp__ 
  1.239409   1.881612   1.085317 
Warning messages:
1: There were 1861 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
2: Examine the pairs() plot to diagnose sampling problems

Some observations:

If the correlations among predictors are approximately 0, a chain of 2000 samples takes about 10 seconds to run. If the correlations among predictors are high, this time can increase to about 500 seconds per chain! I’m aware I can rewrite the normal model in terms of the sufficient statistics, which leads to a good boost in performance, but it doesn’t improve the quality of the sampling (i.e., the R-hat is high).

Is there any reparametrization trick, or another method to improve the efficiency of the sampler, that I’m missing?
All comments, suggestions, or advice is welcome!


#2

Try a higher max treedepth


#3

Thanks! I suppose that works, although it’s not the solution I was hoping for. I’m wondering if there exists something more along the lines of this: http://mc-stan.org/users/documentation/case-studies/qr_regression.html . Of course, it could be that such a transformation doesn’t exist for this scenario.


#4

You can use that technique too, you just repeatedly pay the qr cost


#5

Ok a question. You have defined a set of x and y variables in your r code, but only the x variables are fed into Stan - what exactly are you trying to model here? Simply the correlations amongst x?


#6

Ah my bad, that’s an artifact from a possible expansion where the x variables (predictors) are related to y (a target variable). While playing around with the model, I noticed that fitting a model where x and y are unrelated already gave me difficulties when x is highly correlated (hence the question). I’ll edit and remove the references to y.


#7

Ok my next question. Why are you regressing versus “zeros” ?


#8

I’m not interested in the means, just the correlations and variances. I suppose I could model the means, but would that improve the sampling efficiency? Usually I just center the input and set the mean of the distribution to zero.


#9

If you center the means you could also do some rotation s.t. the variables were not longer so correlated.


#10

Ah ok. But what I’m wondering is if the general form of the model is:
Posterior ~ model(data)*prior
By using model(data) = 0, are you not defining the whole RHS = 0 ? (I could be wrong here - but this is what I understand)


#11

That’s what I am looking for. I suppose I could introduce X^TX = \Sigma = Q\Lambda Q^T (eigendecomposition) and then use W = XQ\Lambda^{-\frac{1}{2}} such that W^TW = I. But then if I model W\sim N(0, \text{diag}(\sigma)LL^T\text{diag}(\sigma)), how do I transform \text{diag}(\sigma) and L into a and B such that these are the parameters of X \sim N(0, \text{diag}(a)BB^T\text{diag}(a))? I realize this happens in the transformed parameters block, but I just don’t know the specific transformations I should apply.


#12

I think it just means that

P(\sigma, L \mid X) \propto N(X\mid 0, \text{diag}(\sigma)LL^T\text{diag}(\sigma))P(\sigma)P(L),

where \sigma are the standard deviations and L is the cholesky of the correlation matrix. The zero means don’t affect the likelihood much as it implies: (x_i - \mu)^T(\text{diag}(\sigma)LL^T\text{diag}(\sigma))^{-1}(x_i-\mu) =x_i^T(\text{diag}(\sigma)LL^T\text{diag}(\sigma))^{-1}x_i.
So the inner product of the normal distribution simplifies a little bit.


#13

But your model above does not have any xi’` on the RHS. You only have x on the left. So then your xi - u term simplifies to -u = 0, leaving Stan with 0 fixed on the right hand side and therefore it is exploring infinite parameter space (I believe). If instead of zeros you allow a parameter b with prior defining mean =0, then it can explore a non-infinite parameter space (I think). So for example:

data{
	int<lower = 0> n;
	int<lower = 0> p;
	matrix[n, p]   x;
	// hyperparameters
	real<lower = 0> priorEta;
	real<lower = 0> priorSdx;
}
//transformed data{
//  vector[p] zeros = rep_vector(0, p);
//}
parameters{
  cholesky_factor_corr[p] Lcorr;
  vector<lower=0>[p]      sdx;
  
  vector[p] b;
}
model{
 b ~ normal(0,1);
  sdx   ~ cauchy(0, priorSdx);
  Lcorr ~ lkj_corr_cholesky(priorEta);
  for (i in 1:n) {
    x[i, ] ~ multi_normal_cholesky(b, diag_pre_multiply(sdx, Lcorr));
  }
}

generated quantities{
    matrix[p, p] r_mat;

    // Generate correlation matrix from Lcorr
    r_mat = multiply_lower_tri_self_transpose(Lcorr);
}

I ran this and then this to recover the correlation matrix:

res  <- stan(file = "Stan models/forum_question.stan",  data = dataList, chains = 4, iter=2000)
ss  <- summary(res)
params <- extract(res)
> sapply(1:p, function(y) sapply(1:p, function(x) median(params$r_mat[, x, y])) )
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.5672136 0.8814693
[2,] 0.5672136 1.0000000 0.1058976
[3,] 0.8814693 0.1058976 1.0000000

This is crudely correct, but took a long time. So I do a traceplot of the betas:

Oooo thats a bit ugly. So we found roughyl the right answers, but it took a logn time and betas are wandering around in space a bit much. I think we can fix this with a non-centered parameterisation… give me a few mins to implement that.


#14

Ok edit on this post.
Non-centered parameterisation didn not help. I just ran the centered version from one post up with tighter priors and it also didn’t help. I’m sorry I am don’t know how write latex equations - but I think your problem is that your model code only has parameters and priors on the RHS - but no data. All your data is on the left hand side of the model equation, and so I think identifiability problems are inescapable unless you change that. This is just what I think looking at your (and my) code above. So I don’t think your code corresponds to the equations that you wrote above. But I may well be wrong - I’m only learning this stuff too.


#15

I have had difficulty with multivariate normals in the past with Stan.

Here’s something I think I’ve never tried: Stan documentation often recommends centering and standardizing variables so that they are mean of 0 and standard deviation of 1. What about going another step further and standardizing so they have zero correlation as well (using cholesky of sample covariance matrix). Then, fit a multivariate normal on this transformed data set and adjust the estimate of the covariance matrix.


#16

Thanks for the suggestion! I considered this, but I failed at working out the Jacobian of the transformation. The crucial part of the transformation is L\text{diag}(1 / \sigma)R^{-1}, where L is the Cholesky of the sample covariance matrix, \text{diag}(1 / \sigma) is a diagonal matrix containing the inverse of the population standard deviations, and R^{-1} is the inverse of the Cholesky of the population correlation matrix.

If I want to sample from a standard normal I need to introduce the change of variables |J(\text{diag}(1 / \sigma)R^{-1}\rightarrow L\text{diag}(1 / \sigma)R^{-1})|. I can express L\text{diag}(1 / \sigma)R^{-1}) symbolically, but I’m not sure how to obtain the Jacobian of the transformation. I have a feeling the Jacobian will be triangular, so its determinant is just the product of the diagonal values, but that’s where I got stuck.


#17

So by chance I had to do the same thing myself just now for my own work and took a fresh attempt at it. Try this:

data{
    int<lower=0> NvarsX;   // number variables
    int<lower=0> N;        // Total number of rows
    vector[NvarsX] x [N];  // data for independent vars
}

parameters{
    cholesky_factor_corr[NvarsX] L_Omega; // L form of correlation matrix
    vector<lower=0>[NvarsX] L_sigma;      // Vector of scales for covariance matrix
    vector[NvarsX] mu;              // vector of mean value for each variable
}

transformed parameters{
    matrix[NvarsX, NvarsX] L_Sigma;
    L_Sigma = diag_pre_multiply(L_sigma, L_Omega);    // L form of Sigma covariance matrix
}

model{
    //Priors
    L_Omega ~ lkj_corr_cholesky(1);
    L_sigma ~ normal(0, 5);

    //likelihood
    x ~ multi_normal_cholesky(mu, L_Sigma);
}

generated quantities{
    matrix[NvarsX, NvarsX] Omega;   // Square form of correlation matrix
    matrix[NvarsX, NvarsX] Sigma;   // Square form of covariance matrix
    
    // Generate correlation matrix
    Omega = multiply_lower_tri_self_transpose(L_Omega);     
    Sigma = quad_form_diag(L_Omega, L_sigma);               
}

Successfully recovers the design covariance in my own test data so hopefully works for you too!
This R code should recover the covariance matrix:

fit1 = stan(file= "model1.stan", data=datalist, iter=2000, chains = 4);

params <- extract(fit1)
sapply(1:NvarsX, function(x) sapply(1:NvarsX, function(y) format( mean(params$Omega[, x, y]) , digits=3) ) )

Edit: -sorry the R code will give you the correlation matrix. Change $Omega to $Sigma and you shoudl get covariance.