Hierarchical prior (for partial pooling) on correlation matrices?


Here is the scenario. I have three very similarly collected datasets with very similar models. In fact, the model for 1 and 2 are identical. Dataset 3 contains all of the variables of 1 and 2, plus a few more.

I want to conduct an integrated data analysis across all datasets, in which all shared parameters across datasets are distributed according to a hierarchical prior. This is, in essence, a random effect model, on all shared parameters.

I have already estimated the models for each dataset individually, and merely wish to make a ‘mega model’ of sorts, with priors on shared parameters.
One of these parameters is the correlation matrix for some random effects.

Studies 1-2 have a cholesky corr matrix of dimension 6, and study 3 has a cholesky corr matrix of dimension 9.
Now, if all three studies had identically-sized corr matrices, then I would simply put a prior on each element of the cholesky corr matrix, and construct the respective cholesky corr matrices.

But because study 3 has a 9x9 (all of the previous cors, but a few new ones), I don’t think that would be appropriate, because the elements of a cholesky factorization changes with dimensionality (correct? or incorrect?).

Should I simply place priors on all of the shared raw correlations, then construct respective correlation matrices, rather than operating on the cholesky elements?


The Cholesky factor of a correlation matrix is about conditional correlations. So, if you partition the 9 variables into {6 common, 3 new}, then the upper left 6x6 block of the Cholesky factor of the 9x9 correlation matrix is the Cholesky factor of the 6x6 correlation matrix among the common variables. So, in other words, you need an ordering, a cholesky_factor_corr[9] L, and some L[1:6,1:6].


Interesting! Thank you.

What distribution is recommended for individual cholesky elements? Is there a distribution like Beta, but ranging from -1 to 1?


Could I use, e.g.,
cor_raw ~ beta(a,b);
then later use:
cor = (cor_raw-.5)*2, to effectively shift the distribution?


You really need a joint prior over positive definite correlation (sub)matrices. When K > 2, the positive definiteness constraint is a complicated nonlinear function and any set of marginal priors on the individual correlations is going to put positive density on symmetric matrices with unit diagonal that are not positive definite. But the LKJ joint prior implies that the individual correlations are distributed beta over the (-1,1) interval with both shape parameters equal. So, I would use that.


I’ll see if the positive definite problem crops up. If so, I’ll just use LKJ, but in the mean time, I coded this up.
It basically reparameterizes beta as mu/nu (mean, ‘sample size’), then defines a correlation distribution using a rescaled beta_mu_nu. Another parameterization uses mu and sigma instead.

Seems to work well in the individual case; just meta-analyzed simulated correlation coefficients, and it took about 2-3 seconds to compute the hierarchical distribution over them.
Was basically just:

r ~ correlation_sigma(true_r, se_r);
true_r ~ correlation_sigma(mu_r, sd_r);
mu_r ~ correlation_sigma(0,.5);
sd_r<lower=0,upper=1> ~ normal(0,.3);

It recovered the true_r and distributional params well.

functions {
	// Computes variance of correlation distribution, given a mu and nu.
	// Basically just computes variance of implied beta distribution, and multiplies by 4
	// Var(2(beta - .5)) = 4Var(beta)
	real variance_correlation(real mu, real nu){
		real a;
		real b;
		real mu_z;
		mu_z = .5*mu + .5;
		a = mu_z*nu;
		b = nu*(1 - mu_z);
		return(4*(a*b)/((a + b)^2*(a + b + 1)));
	// Beta distribution in mu, nu parameterization; nu = (a + b)
	real beta_mu_nu_lpdf(vector x, vector mu, vector nu){
		int len = num_elements(x);
		vector[len] a;
		vector[len] b;
		a = mu .* nu;
		b = nu .* (1 - mu);
		return(beta_lpdf(x | a, b));
	// Takes vector of correlations, vector of beta mu, and beta nu values respectively
	// Basically takes mean and rs, puts them into domain of beta, computes the density, and applies jacobian for transformation
	real correlation_lpdf(vector r,vector mu,vector nu){
		int len = num_elements(r);
		vector[len] r_z; // R -> Beta domain
		vector[len] mu_z; // Mu -> Beta domain
		//real jacobian; //Jacobian is a constant. d(r_z)/d(r) = .5, so should be log(.5)*len.
		r_z = .5*r + .5;
		mu_z = .5*mu + .5;
		return(len*log(.5) + beta_mu_nu_lpdf(r_z | mu_z, nu));
	// Take mean and rs, transform to domain of beta. Applies jacobian.
	// Converts sigma to variance of correlation dist; converts to beta variance (1/4 correlation variance)
	// Solves for nu in beta_mu_nu, using implied beta mu and variance
	real correlation_sigma_lpdf(vector r, vector mu, vector sigma){
		int len = num_elements(r);
		vector[len] r_z;
		vector[len] mu_z;
		vector[len] nu_z;
		vector[len] sigma2_z;
		r_z = .5*(r + 1);
		mu_z = .5*(mu + 1);
		sigma2_z = (sigma .* sigma) / 4;
		nu_z = (mu_z) .* (1 - mu_z) ./ sigma2_z - 1;
		return(len*log(.5) + beta_mu_nu_lpdf(r_z | mu_z, nu_z));


What do you mean “if it comes up”? The probability of an unconstrained 9x9 symmetric matrix with unit diagonal being positive definite is 6.41885e-11.


Ah, I see. So is there no way of placing hierarchical priors on correlations within a matrix then? If all correlations across studies are < |.2|, except for a few correlations which are within [.4, .9], it seems like it would be useful to have some sort of prior that doesn’t just operate on ‘identity-ness’ of the overall matrix.


There are things that can be done, but almost nothing besides the constructions that Stan uses guarantees positive definiteness. One thing is to utilize the fact that a convex combination of correlation matrices is a correlation matrix. So, if you have \boldsymbol{\Lambda}_0 as the common correlation matrix and \boldsymbol{\Lambda}_j as the deviation in the correlation matrix for the j-th group, then \left(1 - \alpha\right) \boldsymbol{\Lambda}_0 + \alpha \boldsymbol{\Lambda}_j is the correlation matrix for the j-th group with \alpha \in \left[0,1\right]. You could then put LKJ priors on \boldsymbol{\Lambda}_j with shape \frac{1}{\alpha} or e^\alpha to encourage the group-specific deviations to be small.


Oh wait; Is there a way to force positive definiteness?
Say I had a cholesky factor of a correlation matrix.
At [1,1], I simply put 1, 0s on the upper diagonal.

Hierarchically pool lower triangle. Constrain the diagonal to be sqrt(1 - sum(rowValues)^2).
Would that then produce a valid correlation matrix?


Not if sum(rowValues^2) > 1


Right; I see (Just tried it; estimated fine, but had divergences - The parameters were recovered well though).

I can’t see exactly what you wrote above, because it says “Undefined control sequence \boldsymbol”

If you could remove the \boldsymbol, I’d be grateful, so I can better understand what you wrote.

I take it this is a sort of non-centered parameterization for cor matrices? Estimate a common matrix, estimate ‘difference’ matrices, and combine them to produce a group-specific matrix?


Ah I see; is \alpha a parameter, or is it fixed? I guess it’s basically just the scaling parameter, correct? As \alpha \rightarrow 1, the LKJ distribution approaches uniformity, and less pooling occurs. As \alpha \rightarrow 0, the LKJ prior is sharply peaked around identity, and full pooling basically occurs (Because all the weight is on \Lambda_0 ? Is that right?

And \Lambda_j is essentially a difference matrix? So it’s a bit like the non-centered random effect approach?


You may have to enable Javascript or something to see the \LaTeX, but yes it was just Lambda_j. You would declare real<lower=0,upper=1> alpha; in the parameters block and estimate it, perhaps with a Beta prior.


I’ll give it a shot. Thank you so much for your help. Stan discourse is the ultimate applied stats journal.


Thank you @bgoodri
That worked strikingly well on a test case of mine.
It was also quite fast. With a corr_matrix[9] \Lambda_0 and 4 corr_matrix[9] \Lambda_j, it took a minute.
The code used:
BFI is just a list of 4 100x9 subsets from the psych::bfi dataset in R.

Are there any efficiency improvements available here? I assume cholesky correlation matrices do not have the convex combination property; I could still sample as cholesky, but I assume the act of transforming it to a correlation matrix, then perhaps back to a cholesky factor for multi_normal_cholesky would undo the efficiency benefits of using cholesky in the first place.

data {
	int N;
	row_vector[9] bfi[4,N];

parameters {
	real<lower=0,upper=1> alpha; //Pooling coefficient
	corr_matrix[9] R_mu;
	corr_matrix[9] R_diff[4];

transformed parameters {
	corr_matrix[9] R_group[4];
	for(g in 1:4){
		R_group[g] = (1 - alpha)*R_mu + alpha*R_diff[g];

model {
	R_mu ~ lkj_corr(3);
	alpha ~ beta(2,2);
	for(g in 1:4){
		R_diff[g] ~ lkj_corr(1/alpha);
		bfi[g] ~ multi_normal(rep_vector(0.0,9),R_group[g]);


The convex combination thing has to be done on correlation matrices, but you can get it slightly faster and more numerically stable by declaring things as cholesky_factor_corr and forming the correlation matrices yourself with multiply_lower_tri_self_transpose. That doesn’t help with the multivariate normal part, but it does allow you to do ~ lkj_corr_cholesky(shape) rather than ~lkj_corr(shape) to avoid having to do a Cholesky decomposition in order to evaluate the determinant in the LKJ PDF (when shape is not 1).


One more question:

Again, 2 samples have a 6x6 cor matrix; one sample has 9x9 matrix, with the 6x6 embedded. I want the embedded 6x6 to be pooled along with the other 6x6 matrices toward an overall \Lambda_0. Does the convex combination property hold for subsets of correlation matrices as well?

Should there be a 9x9 \Lambda_0 (global matrix), and R_{j,1:6,1:6} = (1 - \alpha)\Lambda_{0,1:6,1:6} + \alpha \Lambda_j, where \Lambda_j is a 6x6 matrix? Then for sample 3, the last 3 columns/rows should just be whatever is in those cells for \Lambda_0?


Yes it holds for all submatrices of a correlation matrix.


I can’t get the approach to work. It throws positive definite errors for R_group_nine, which is the 9x9 correlation matrix for sample 3.

Essentially, I estimate a 9x9 global correlation matrix (cholesky), and 3 6x6 correlation matrices (cholesky; one for each sample).
I combine them using the function you mentioned, to produce R_group correlation matrices - The 6x6 pooled correlation matrix for each sample.
For samples 1 and 2, these are the correlations used in multi_normal().
For sample 3, I embed those values into an overall 9x9 matrix; the remaining 3 rows/columns are filled in with respective values from the 9x9 global matrix.
The positive definite matrix is failing on that 9x9 correlation matrix for sample 3.

Below is the test code I am using.

functions {
	matrix convex_combine_cholesky(matrix global_chol_cor, matrix local_chol_cor, real alpha){
		int dim = rows(global_chol_cor);
		matrix[dim,dim] global_cor = multiply_lower_tri_self_transpose(global_chol_cor);
		matrix[dim,dim] local_cor = multiply_lower_tri_self_transpose(local_chol_cor);
		matrix[dim,dim] combined_chol_cor = cholesky_decompose((1 - alpha)*global_cor + (alpha)*local_cor);

data {
	int N;
	int G;
	row_vector[6] bfi[G-1,N];
	row_vector[9] bfi_nine[N];

parameters {
	real<lower=0,upper=1> alpha; //Pooling coefficient; hi alpha, independence; lo alpha, total dependence
	cholesky_factor_corr[9] R_global; //Overall corr matrix
	cholesky_factor_corr[6] R_diff[G]; // "Difference" matrices of sorts

transformed parameters {
	corr_matrix[6] R_group[G]; // Construct group-specific correlation matrices
	corr_matrix[9] R_group_nine; // The 9x9 matrix in which sample 3's 6x6 is embedded
	corr_matrix[9] R_global_cor = multiply_lower_tri_self_transpose(R_global);
	for(g in 1:G){
		R_group[g] = multiply_lower_tri_self_transpose(convex_combine_cholesky(R_global[1:6,1:6],R_diff[g],alpha));
	// Fill in sample 3's 9x9 matrix with the 6x6 values + the remaining rows/columns from R_global_cor
	for(c in 1:6){
		for(r in c:6){
			R_group_nine[r,c] = R_group[3,r,c];
			R_group_nine[c,r] = R_group_nine[r,c];
	for(r in 7:9){
		for(c in 1:r){
			R_group_nine[r,c] = R_global_cor[r,c];
			R_group_nine[c,r] = R_group_nine[r,c];

model {
	R_global ~ lkj_corr(3); 
	alpha ~ beta(2,2); //Can be tweaked
	for(g in 1:2){
		R_diff[g] ~ lkj_corr_cholesky(1/alpha); //1/alpha keeps differences large when alpha high; small when alpha low
		bfi[g] ~ multi_normal(rep_vector(0.0,6),R_group[g]);
	R_diff[3] ~ lkj_corr_cholesky(1/alpha);
	bfi_nine ~ multi_normal(rep_vector(0.0,9),R_group_nine);