Hierarchical Correlation?

@spinkney Is fast and looks fancy but the priors over bivariate correlations differ quite a lot depending on the column index. R code showing this is below – passing in a vector of 1’s results in different correlations in each column.

@mike-lawrence It might be nice for some situations, but doesn’t work when you actually need a correlation matrix without explicitly computing a correlation from something (as when using it for integration or non-centered parameterizations), and I’m also vaguely worried that the non-generative nature of it implies weird things that are hard to track down ;)

require(rstan)

smt<-'
functions{
  matrix angle_vec2mat (vector angle, int K) {
    int N = num_elements(angle);
    matrix[K, K] mat = add_diag(identity_matrix(K), rep_vector(-1, K));
    int count = K;
    int pos = 1;
    for (k in 1:K - 1){
      count -= 1;
      mat[k + 1:K, k] = segment(angle, pos, count);
      pos += count;
    }
    return mat;
  }
  
matrix angle2cor (vector angle, int N){
  matrix[N, N] angle_mat = angle_vec2mat(angle, N);
  matrix[N, N] inv_mat = identity_matrix(N);
  inv_mat[, 1] = cos(angle_mat[, 1]);
  
  for (i in 2:N) {
    inv_mat[i, i] = prod(sin(angle_mat[i, 1:(i - 1)]));
    if (i > 2) {
      for (j in 2:i - 1 ) {
        inv_mat[i, j] = cos(angle_mat[i, j]) * prod(sin(angle_mat[i, 1:(j - 1)]));
      }
    }
  }
  return inv_mat;
  }
}
data{
  int d;
  int n;
  vector[d] dat[n];
  real corpriorscale;
}
parameters{
  vector[(d * d - d) / 2] rawcor;
  vector[d] logsdvec;
  vector[d] mu;
}
transformed parameters{
  cholesky_factor_corr[d] mcorchol = angle2cor(rawcor,d);
}
model{
  rawcor ~ normal(0,corpriorscale);
  logsdvec ~ normal(0,10);
  mu ~ normal(0,10);
  dat ~ multi_normal_cholesky(mu, diag_pre_multiply(exp(logsdvec), mcorchol));
}
generated quantities{
  matrix[d,d] mcor = tcrossprod(mcorchol);
}
'


sm <- stan_model(model_code = smt)
expose_stan_functions(sm)

ndim=5
rawcor=rep(1,(ndim*ndim-1)/2)


m=tcrossprod(angle2cor(angle=rawcor,N=ndim))
m

It’s guaranteeing PSD so it will almost surely differ by column. Having data will get it close to true value and you shouldn’t get any issues after initial warmup of not being PSD.

When you call it in R it’s just using that “init”, I’m curious if this works for you with actual data and running iterations over the model?

@spinkney Well, the LKJ approach and the approach I first posted (ie years ago now) both guarantee PSD and don’t differ by column. My ‘semi fix’ also only weakly differs by column, and allows easy hierarchy with decent sampling / optimizing behaviour. The example code I posted is not just about the inits – if you imagine, for example, a normal prior on the rawcor parameters, the fact that some columns end up lower than others for the same input implies that the prior for the bivariate correlation is different.

I’ll have to give some further thought on how this can be incorporated mid-hierarchy while permitting non-centered parameterization of the variates the set of correlations apply to. I’d been thinking about that already and thought I had a plan, but you’ve given me pause and now I’m less confident.

What is the integration use case?

Yeah, I’m definitely worried about that too. I suspect the ultimate properly generative “solution” is to shift everything to an SEM framework with the “parameter expansion” trick to encode the correlations, but this independent-bivariates method is so fast I’m having trouble letting it go quite yet :P

Right, right. And now I get the issue more. Yea, I mean it guarantees it by this specific construction. This - as you said in your first post today - may take some magic to input a normal prior along each column to represent the same prior weight along each dimension.

1 Like

@mike-lawrence re integration, I use subject specific covariance matrices in a kalman filtering setup, for one example. I also like to be able to fix certain values to zero or to certain others, which makes the parameter expansion approach also pretty tricky I think! Oh, and I doubt you can optimize a parameter expanded form, and some of us like to do that ;)

On further consideration, it’s definitely trivial to achieve the shift/scaling associated with a non-centered parameterization, where scaling I understand is the core benefit of NP over CP generally (to avoid the funnel). I’m still mulling on whether the de-correlation of the typical multivariate NP is important and if so whether there’s a way to achieve that still (probably not but I haven’t given up).

1 Like

Ok, so I flipped the problem a bit and I think this works. Let’s get a PD matrix that is as close to possible to the correlations you want.

functions{
  matrix angle_vec2mat (vector angle, int K) {
    int N = num_elements(angle);
    matrix[K, K] mat = add_diag(identity_matrix(K), rep_vector(-1, K));
    int count = K;
    int pos = 1;
    for (k in 1:K - 1){
      count -= 1;
      mat[k + 1:K, k] = segment(angle, pos, count);
      pos += count;
    }
    return mat;
  }
  
matrix angle2cor (vector angle, int N){
  matrix[N, N] angle_mat = angle_vec2mat(angle, N);
  matrix[N, N] inv_mat = add_diag(angle_mat, rep_vector(1, N));
  inv_mat[, 1] = cos(angle_mat[, 1]);
  
  
  for (i in 2:N) {
    inv_mat[i, i] = prod(sin(angle_mat[i, 1:(i - 1)]));
    if (i > 2) {
      for (j in 2:i - 1 ) {
        inv_mat[i, j] = cos(angle_mat[i, j]) * prod(sin(angle_mat[i, 1:(j - 1)]));
      }
    }
  }

  return inv_mat;
  }

 vector lower_elements(matrix M, int tri_size){
    int n = rows(M);
    int counter = 1;
    vector[tri_size] out;
 
    for (i in 2:n){
      for (j in 1:(i - 1)) {
        out[counter] = M[i, j];
        counter += 1;
      }
    }
    return out;
  }
}
data{
  int d;
  int n;
  vector[d] dat[n];
  real corpriorscale;
} 
transformed data {
  int M = (d * d - d) / 2;
}
parameters{
  vector<lower=0, upper=pi()>[M] rawcor;
  vector<lower=-1, upper=1>[M] cors;
  vector[d] logsdvec;
  vector[d] mu;
}
transformed parameters{
  cholesky_factor_corr[d] L_m = angle2cor(rawcor,d);
  corr_matrix[d] mcor = multiply_lower_tri_self_transpose(L_m);
}
model{
  cors ~ normal(lower_elements(mcor, M), corpriorscale);
  logsdvec ~ normal(0,10);
  mu ~ normal(0, 10);
 
  dat ~ multi_normal_cholesky(mu, diag_pre_multiply(exp(logsdvec), L_m));
}

We can really stress test this by flipping the cors parameters into data. Just like you were doing when you were exposing functions. Though we can’t do that anymore because we need the soft constraint of the normal to regularize a PD near the ones we’re looking for.

functions{
  matrix angle_vec2mat (vector angle, int K) {
    int N = num_elements(angle);
    matrix[K, K] mat = add_diag(identity_matrix(K), rep_vector(-1, K));
    int count = K;
    int pos = 1;
    for (k in 1:K - 1){
      count -= 1;
      mat[k + 1:K, k] = segment(angle, pos, count);
      pos += count;
    }
    return mat;
  }
  
matrix angle2cor (vector angle, int N){
  matrix[N, N] angle_mat = angle_vec2mat(angle, N);
  matrix[N, N] inv_mat = add_diag(angle_mat, rep_vector(1, N));
  inv_mat[, 1] = cos(angle_mat[, 1]);
  
  
  for (i in 2:N) {
    inv_mat[i, i] = prod(sin(angle_mat[i, 1:(i - 1)]));
    if (i > 2) {
      for (j in 2:i - 1 ) {
        inv_mat[i, j] = cos(angle_mat[i, j]) * prod(sin(angle_mat[i, 1:(j - 1)]));
      }
    }
  }

  return inv_mat;
  }

 vector lower_elements(matrix M, int tri_size){
    int n = rows(M);
    int counter = 1;
    vector[tri_size] out;
 
    for (i in 2:n){
      for (j in 1:(i - 1)) {
        out[counter] = M[i, j];
        counter += 1;
      }
    }
    return out;
  }
}
data{
  int d;
  int n;
  vector[d] dat[n];
  real corpriorscale;
  real x;
} 
transformed data {
  int M = (d * d - d) / 2;
  vector[M] cors = rep_vector(x, M);
}
parameters{
  vector<lower=0, upper=pi()>[M] rawcor;
 // vector<lower=-1, upper=1>[M] cors;
  vector[d] logsdvec;
  vector[d] mu;
}
transformed parameters{
  cholesky_factor_corr[d] L_m = angle2cor(rawcor,d);
  corr_matrix[d] mcor = multiply_lower_tri_self_transpose(L_m);
}
model{
  cors ~ normal(lower_elements(mcor, M), corpriorscale);
  logsdvec ~ normal(0,10);
  mu ~ normal(0, 10);
 
  dat ~ multi_normal_cholesky(mu, diag_pre_multiply(exp(logsdvec), L_m));
}

Let’s test the original where corpriorscale = 10 and the correlation is 0 so x = 0. Then where corpriorscale = 0.2 and x = 0.9 to show how it pushes the correlation matrix up toward 0.9.

sdat <- list(n=n,d=d, dat=dat,corpriorscale= 10,
             x = 0)
> mod_out$summary("mcor")
# A tibble: 25 x 10
   variable     mean  median     sd    mad      q5    q95  rhat ess_bulk ess_tail
   <chr>       <dbl>   <dbl>  <dbl>  <dbl>   <dbl>  <dbl> <dbl>    <dbl>    <dbl>
 1 mcor[1,1]  1       1      0      0       1       1     NA         NA       NA 
 2 mcor[2,1]  0.722   0.734  0.0907 0.0874  0.551   0.844  1.00    2133.    2383.
 3 mcor[3,1] -0.172  -0.179  0.177  0.178  -0.453   0.130  1.00    4418.    2912.
 4 mcor[4,1]  0.239   0.249  0.173  0.172  -0.0664  0.507  1.00    2147.    2321.
 5 mcor[5,1] -0.409  -0.422  0.153  0.151  -0.641  -0.131  1.00    3097.    2750.
 6 mcor[1,2]  0.722   0.734  0.0907 0.0874  0.551   0.844  1.00    2133.    2383.
 7 mcor[2,2]  1       1      0      0       1       1     NA         NA       NA 
 8 mcor[3,2]  0.0120  0.0155 0.182  0.190  -0.293   0.303  1.00    4187.    2889.
 9 mcor[4,2]  0.828   0.837  0.0600 0.0536  0.718   0.907  1.00    3130.    2966.
10 mcor[5,2] -0.501  -0.515  0.137  0.134  -0.701  -0.251  1.00    3243.    3085.
# … with 15 more rows


sdat <- list(n=n,d=d, dat=dat,corpriorscale= 0.2,
             x = 0.9)
> mod_out$summary("mcor")
# A tibble: 25 x 10
   variable   mean median     sd    mad    q5   q95  rhat ess_bulk ess_tail
   <chr>     <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
 1 mcor[1,1] 1      1     0      0      1     1     NA         NA       NA 
 2 mcor[2,1] 0.902  0.907 0.0345 0.0316 0.836 0.950  1.00    1397.    1766.
 3 mcor[3,1] 0.611  0.616 0.0889 0.0892 0.458 0.750  1.00    2151.    2387.
 4 mcor[4,1] 0.718  0.725 0.0777 0.0756 0.575 0.833  1.00    1771.    2087.
 5 mcor[5,1] 0.556  0.559 0.0937 0.0955 0.400 0.701  1.00    1964.    2642.
 6 mcor[1,2] 0.902  0.907 0.0345 0.0316 0.836 0.950  1.00    1397.    1766.
 7 mcor[2,2] 1      1     0      0      1     1     NA         NA       NA 
 8 mcor[3,2] 0.720  0.723 0.0673 0.0672 0.605 0.824  1.00    2942.    2419.
 9 mcor[4,2] 0.942  0.946 0.0247 0.0213 0.894 0.973  1.00    2253.    2528.
10 mcor[5,2] 0.608  0.612 0.0779 0.0787 0.473 0.729  1.00    2801.    2709.

hah ok so some kind of ‘nearest pos def’ hack. Seems a little wild but I’ll think about it!

Ok, inspired by @spinkney

The idea is that you can just use a uniform LKJ_cholesky. Therefore, we really only use it to get a 1) cholesky for a 2) PSD matrix, uniformly distributed across all permissible PSD matrices.

As cor matrices grow, the PSD constraint forces the volume of permissible solutions to amass around an identity matrix. That is, the PSD constraint forces most permissible solutions to be around identity, so there is much greater volume there. To counteract this, you can add more density to solutions where the cor matrix is far from identity. I.e., add density to elements in the cor matrix away from identity.

Sean’s method does this. My approach below is effectively the same thing.

I use a uniform LKJ to get a PSD matrix.
Then I just add density to elements. This can be done in 2 identical ways: Saying target_cor ~ normal(estimated_cor, scalingFactor) or estimated_cor ~ normal(target_cor, scalingFactor). It’s the same thing, due to the normal construction. I implement both, with a switch between the two.

In practice, this means one could have a-priori information about certain correlations, and push that into the model as a normal prior on the elements. The model below is just a simplistic version where we feed in (one) prior location and (one) prior scale. This could be expanded out to each unique correlation element.

functions {
  vector get_lower_tri(matrix mat) {
    int d = rows(mat);
    vector[d * (d - 1) / 2] lower;
    int count = 1;
    for(r in 2:d) {
      for(c in 1:(r-1)) {
	lower[count] = mat[r,c];
	count += 1;
      }
    }
    return(lower);
  }
}

data {
  int d;
  int n;
  vector[d] dat[n];
  real corpriorscale;
  real x;
  int cors_from_normal;
}

transformed data {
  int M = (d*d - d) / 2;
  vector[M] cors = rep_vector(x, M);
}

parameters {
  cholesky_factor_corr[d] cormat_L;
  vector<lower=0>[d] sdvec;
  vector[d] mu;
}

transformed parameters {
  matrix[d,d] cormat = multiply_lower_tri_self_transpose(cormat_L);
  matrix[d,d] covmat_L = diag_pre_multiply(sdvec, cormat_L);
}


model {
  cormat_L ~ lkj_corr_cholesky(1); 
  if(cors_from_normal) {
    cors ~ normal(get_lower_tri(cormat), corpriorscale);
  } else {
    get_lower_tri(cormat) ~ normal(cors, corpriorscale);
  }
  sdvec ~ normal(0,1);
  mu ~ normal(0,1);

  dat ~ multi_normal_cholesky(mu, covmat_L);
}



I get very similar (i.e., basically identical) estimates using @spinkney 's approach in this example.

This approach is fairly easy, and intuitive. 1) Generate [uniform, PSD] matrix 2) Add priors to elements.
My approach and Sean’s are ultimately the same, which is great.

1 Like

That’s great @Stephen_Martin ! This is def simpler and easier to use.

Is there a succinct way to express the generative process for this solution?

Like how one would simulate this in R? Or something else?

It’s ultimately defining a joint prior.

Let R be a cor matrix.

p(R) = \text{LKJ}(R|\eta = 1)\prod_u \mathcal{N}(R_u|\mu_u, \sigma_u)

Where u just indexes the unique elements of R.

I.e., it’s just a joint prior you’d need to sample from. All the Normal contributions are just specifying that you think there is more density for those elements than a uniform LKJ would otherwise imply.

For generating from this… I’d just sample from that prior using MCMC; not sure there’s an easy approach to generating new values. It’s a ‘general’ problem of sampling from joint distributions, not specific to this.

2 Likes

If it helps anyone, I have had a lot of difficulty in Stan with multivariate normal distributions that take a factor structure. The LKJ prior does not do a good job in this case because it tends toward an identity matrix depending on the degrees of freedom. It doesn’t really admit a factor structure.

The one solution I had that seemed to work the best was to basically start thinking in terms of vine copulas. If the univariate distributions are normal and the multivariate dependence is a normal copula in every case, then you can really think of it like regressions. So whether to use a R/C/D vine copula is just a different way to set up the regressions. The first thing I had tried was basically a situation where I take the first variable as a base and for the second variable I regress it on the first, the third is then regressed on the first two. I have also spent a little time setting up a proper C-vine copula, but I got distracted and hadn’t finished it.

Anyway, once you get things in terms of a vine copula, then you are left with individual correlations with a structure that depends on how you set it up (i.e. the individual correlations in a C-vine vs. a D-vine are very different). However, you can still take a hierarchical approach to these correlations. You might have fewer problems with trying to keep the correlation matrix positive semi-definite because of how the vines are constructed.

1 Like

That’s interesting, do you have some examples in stan and how that relates to c-vines or d-vines? I don’t know much about these except from the LKJ paper which I also didn’t quite understand. Would love to know more about this approach

As I said, I started on a C-vine version but got distracted by other things so it’s on the backburner right now, but I hope to finish it when I have more time.

It’s all pretty simple though when you deal with normal margins and normal dependence because you can think of it like a series of regressions. It took me a while to really grok what they are talking about on the wikipedia page [1], but for a C-vine in this case (normal univariate, normal dependence) it’s like getting the correlation of one variable with the others, then regressing each of the other variables on that one and get the correlations of the residuals of those pairs, and then you do a regression with two variables and get the correlations of the resulting pairs of residuals, and so on. It has levels to it.

The only annoying part is organizing the data so that everything is going in the right place. If Stan had ragged arrays built-in, then it would be much easier to express the pattern. You have to use this database approach to get the right data. The nice thing though is that it is easy to enforce the constraints on the correlations and the slopes don’t need constraints.

[1] Vine copula - Wikipedia

1 Like

From first reading these vines piggyback off the bivariate copulas to get multivariate ones without the complication of actually trying to derive those. This is similar to how bivariate copulas generalized bivariate distributions by using univariate marginals.

I’ve written up the multivariate normal copula and other bivariate ones. I was really more curious with a simple example of how one would code up, say, connecting two different bivariate copula families or a bivariate copula and a univariate distribution. Simple examples that build up intuition to code up the more general cases.

@moderators I propose moving the last few replies to a new topic. I don’t know how to bulk move. A separate topic starting with @jmh530’s post on vines would be better. And linking back here.

1 Like

I agree that simple examples help build intuition. That’s why I would do it first using the bivariate normal copula likelihoods and check that the vine version produces the same results. Then you can write a non-normal bivariate copula likelihood and see what happens when you change one of them.

Check out pages 9 and 10 of this

I think that makes it pretty clear on how the decomposition is happening. Page 9 would require you to condition variables 1 and 3 on variable 2, choose a density for that relationship, and then fit a copula. C/R/D vine is just about how to choose the decomposition.