# Applying Cholesky factorization to a QR

Hi everyone, I am trying to apply a QR reparametrization to a Cholesky factorized hierarchical model. The source I am using for both is the Stan user’s guide, specifically 1.2 and 1.13.

My Choelsky factorized model is this:

\beta = \mu + \Sigma_l z

where beta is a matrix of cluster specific slopes, \mu is the matrix of expected values, \Sigma_l is the Cholesky matrix and z_{i,j} \sim \mathcal{N}(0,1). Therefore,

\beta \sim \mathcal{MN} (\mu, \Sigma_l \Sigma_l')

Now, by applying the QR reparametrization (the notation is the same as the guide), I need to put a prior on the transformed parameter \theta = R^* \beta .

\theta = R^*\mu + R^*\Sigma_l z

My question is, what is the correct and most efficient way of programming this on Stan? What I did was the following (but I suspect there is a better way).

parameters{
vector<lower=0,upper=pi()/2>[K] tau_unif;
cholesky_factor_corr[K] L_Omega;
}

transformed parameters{

matrix[K,K] L_Sigma;
vector<lower=0>[K] tau = tan(tau_unif);

L_Sigma = cholesky_decompose(R_ast*diag_pre_multiply(tau, L_Omega)*diag_pre_multiply(tau,
L_Omega)'*R_ast');

theta = R_ast * mu + L_Sigma * z;
}


One small queastion that also bothers me, do I need to pull the constant (a vector of 1’s) out of the data matrix when I do the QR reparametrization? And if so, why?

Thanks again for all the help

I also put the code of the simple Cholesky model, although it is basically the same as the user guide:

transformed parameters{

beta = mu + diag_pre_multiply(tau, L_Omega) * z;
}

3 Likes

I am growing more confident in my approach, from a theoretical sense it seems correct. I was wondering if it is the most efficient though.

1 Like

I am also interested in someone with deep knowledge of math library on how the underlying representations of eigen matrices to comment on this.

2 Likes
2 Likes

I actually was musing just this past weekend on whether QR representation of the mid-hierarchy structures like this would make sense. Certainly I’ve only ever seen the QR transform applied to predictor matrices that are Data variables.

But it seems here you are looking to QR transform the Cholesky matrix, which is different from what I was thinking. I thought that the cholesky representation of the covariance matrix was itself sure to have orthogonal structure (uncorrelated/low-correlated columns), but maybe that’s not sure to be the case?

No wait, I am afraid I was not clear, I am using QR on the data matrix and applying Cholesky on the coefficient of my likelihood. I am posting the entire model to be as clear as possible.

This is a sketch of the model (I am sorry I omitted the likelihood before, I should have been clearer), so I am applying QR on X and Cholesky on the var-covar of \beta. But because of the QR, I need to put a prior on \theta=R^*\beta, hence my question on Cholesky on \theta.

y \sim N(X \beta, v) \\ \beta \sim \mathcal{MN} ( \mu, \Sigma_l \Sigma_l')

Here’s the code:

data {
int<lower=0> N;              // num individuals
int<lower=1> K;              // num ind predictors
int<lower=1> J;              // num groups
int<lower=1> L;              // num group predictors
int<lower=1,upper=J> g[N];  // group for individual
matrix[N, K] x;             // individual predictors
matrix[J,L] u;                 // group predictors
vector[N] y;                 // outcomes

vector[N] w;                      // treatment assigned
}

transformed data{

matrix[L,J] u_transp;
matrix[N, K] Q_ast;
matrix[K, K] R_ast;

// thin and scale the QR decomposition

Q_ast = qr_thin_Q(x) * sqrt(N - 1);
R_ast = qr_thin_R(x) / sqrt(N - 1);

// group predictors transposed

u_transp = u';
}

parameters {

matrix[K,J] z;               // this will be the basis for the beta's
// transformation
vector[J] z_2;

matrix[K,L] gamma;              // school predictors coefficients
vector<lower=0,upper=pi()/2>[K] tau_unif;  // heterskedasticity component
// for the coefficients
real<lower=0,upper=pi()/2> sigma_alpha_unif;      // SE for the constant

vector[L] eta;

real effect;                      // super-population average treatment effect
real<lower=0,upper=pi()/2> sigma_t_unif;        // residual SD for the treated
real<lower=0,upper=pi()/2> sigma_c_unif;        // residual SD for the control

cholesky_factor_corr[K] L_Omega;  // corr matrix for the slopes

}

transformed parameters{

vector[J] alpha;                 // vector of school-specific intercepts
vector[J] c;                     // intercepts' expected value

// All SE's have half-Cauchy priors, but we sample them from a Uniform and
// then take the tangent as it is computationally more efficient, note that
// if X is Uniform, then Y is a std half-Cauchy

vector<lower=0>[K] tau = tan(tau_unif);
real<lower=0> sigma_alpha = tan(sigma_alpha_unif);
real<lower=0> sigma_t = tan(sigma_t_unif);
real<lower=0> sigma_c = tan(sigma_c_unif);

// School specific slopes, since it is the transformation of a
// multivariate normal, we do not need to give a prior to beta, but rather
// to z and to the variance component (tau) and the correlation component
// of the Cholensky decomposition (L_Omega). On top of that, we utilize the
// QR-reparametrized theta = R_ast * beta

matrix[K,J] theta;            // vector of school-specific coefficients
matrix[K,K] L_Sigma;

// Here we define beta, not that since it is the transformation of a
// multivariate normal, we do not need to give a prior to beta, but rather
// to z and to the variance component (tau) and the correlation component
// of the Cholensky decomposition (L_Omega)

// This is theta's variance-covariance matrix

L_Sigma = cholesky_decompose(R_ast*diag_pre_multiply(tau, L_Omega)*diag_pre_multiply(tau, L_Omega)'*R_ast');

theta = R_ast * (gamma * u_transp) + L_Sigma * z;

// intercepts' expected value

for (j in 1:J) {
c[j] = eta' * u_transp[ ,j];
}

// School specific intercepts

alpha = c + sigma_alpha * z_2;
}

model {

vector[N] m;
vector[N] d;
vector[N] uno;

tau_unif ~ uniform(0,pi()/2);
L_Omega ~ lkj_corr_cholesky(2);

to_vector(z) ~ std_normal();
to_vector(gamma) ~ std_normal();
z_2 ~ std_normal();

effect ~ normal(0,2);
sigma_c_unif ~ uniform(0,pi()/2);
sigma_t_unif ~ uniform(0,pi()/2);
sigma_alpha_unif ~ uniform(0,pi()/2);

// Here we model the likelihood of our outcome variable y

// Its expected value

for (n in 1:N) {
m[n] = alpha[g[n]] + Q_ast[n, ] * theta[, g[n]] + effect*w[n];
}

// Its variance

uno = rep_vector(1, N);
d = sigma_t*w + sigma_c*(uno-w);

y ~ normal(m, d);

}


As you can see I have individual level variables that I gather in a matrix X to which I apply the QR.

Then I give a Normal likelihood “regression style”. But now the coefficient of the regression is not \beta anymore, but the QR transformed \theta, hence the dilemma on how to do Cholesky on \theta = R^* \beta .

I hope I was clear, otherwise let me know. By the way, the model gives reliable results so I think the procedure is correct. (interesting note: I get slightly lower ESS with QR + Cholesky)

Also does anybody know if one must partial out the constant before applying the QR?

1 Like

I’m not an expert on the QR decomposition. The only thing I see is testing this:

You can change

  L_Sigma = cholesky_decompose(R_ast*diag_pre_multiply(tau, L_Omega)*diag_pre_multiply(tau, L_Omega)'*R_ast');


to

  L_Sigma = cholesky_decompose(quad_form(multiply_lower_tri_self_transpose( diag_pre_multiply(tau, L_Omega )), R_ast'));

2 Likes