Applying Cholesky factorization to a QR

Hi everyone, I am trying to apply a QR reparametrization to a Cholesky factorized hierarchical model. The source I am using for both is the Stan user’s guide, specifically 1.2 and 1.13.

My Choelsky factorized model is this:

\beta = \mu + \Sigma_l z

where beta is a matrix of cluster specific slopes, \mu is the matrix of expected values, \Sigma_l is the Cholesky matrix and z_{i,j} \sim \mathcal{N}(0,1). Therefore,

\beta \sim \mathcal{MN} (\mu, \Sigma_l \Sigma_l')

Now, by applying the QR reparametrization (the notation is the same as the guide), I need to put a prior on the transformed parameter \theta = R^* \beta .

\theta = R^*\mu + R^*\Sigma_l z

My question is, what is the correct and most efficient way of programming this on Stan? What I did was the following (but I suspect there is a better way).

parameters{
vector<lower=0,upper=pi()/2>[K] tau_unif;
cholesky_factor_corr[K] L_Omega;
}

transformed parameters{

matrix[K,K] L_Sigma;
vector<lower=0>[K] tau = tan(tau_unif);

L_Sigma = cholesky_decompose(R_ast*diag_pre_multiply(tau, L_Omega)*diag_pre_multiply(tau, 
L_Omega)'*R_ast');
  
theta = R_ast * mu + L_Sigma * z;
}

One small queastion that also bothers me, do I need to pull the constant (a vector of 1’s) out of the data matrix when I do the QR reparametrization? And if so, why?

Thanks again for all the help

I also put the code of the simple Cholesky model, although it is basically the same as the user guide:

transformed parameters{

beta = mu + diag_pre_multiply(tau, L_Omega) * z;
}
3 Likes

I am growing more confident in my approach, from a theoretical sense it seems correct. I was wondering if it is the most efficient though.

1 Like

I am also interested in someone with deep knowledge of math library on how the underlying representations of eigen matrices to comment on this.

2 Likes

@bbbales2 @spinkney

2 Likes

I actually was musing just this past weekend on whether QR representation of the mid-hierarchy structures like this would make sense. Certainly I’ve only ever seen the QR transform applied to predictor matrices that are Data variables.

But it seems here you are looking to QR transform the Cholesky matrix, which is different from what I was thinking. I thought that the cholesky representation of the covariance matrix was itself sure to have orthogonal structure (uncorrelated/low-correlated columns), but maybe that’s not sure to be the case?

No wait, I am afraid I was not clear, I am using QR on the data matrix and applying Cholesky on the coefficient of my likelihood. I am posting the entire model to be as clear as possible.

This is a sketch of the model (I am sorry I omitted the likelihood before, I should have been clearer), so I am applying QR on X and Cholesky on the var-covar of \beta. But because of the QR, I need to put a prior on \theta=R^*\beta, hence my question on Cholesky on \theta.

y \sim N(X \beta, v) \\ \beta \sim \mathcal{MN} ( \mu, \Sigma_l \Sigma_l')

Here’s the code:

data {
  int<lower=0> N;              // num individuals
  int<lower=1> K;              // num ind predictors
  int<lower=1> J;              // num groups
  int<lower=1> L;              // num group predictors
  int<lower=1,upper=J> g[N];  // group for individual
  matrix[N, K] x;             // individual predictors
  matrix[J,L] u;                 // group predictors
  vector[N] y;                 // outcomes
  
  vector[N] w;                      // treatment assigned
}

transformed data{
  
  matrix[L,J] u_transp; 
  matrix[N, K] Q_ast;
  matrix[K, K] R_ast;
  
  // thin and scale the QR decomposition
  
  Q_ast = qr_thin_Q(x) * sqrt(N - 1);
  R_ast = qr_thin_R(x) / sqrt(N - 1);
  
  // group predictors transposed
  
  u_transp = u';
}

parameters {
  
  matrix[K,J] z;               // this will be the basis for the beta's
                              // transformation
  vector[J] z_2;
  
  matrix[K,L] gamma;              // school predictors coefficients
  vector<lower=0,upper=pi()/2>[K] tau_unif;  // heterskedasticity component
                                             // for the coefficients
  real<lower=0,upper=pi()/2> sigma_alpha_unif;      // SE for the constant
  
  vector[L] eta;
  
  real effect;                      // super-population average treatment effect
  real<lower=0,upper=pi()/2> sigma_t_unif;        // residual SD for the treated
  real<lower=0,upper=pi()/2> sigma_c_unif;        // residual SD for the control
  
  cholesky_factor_corr[K] L_Omega;  // corr matrix for the slopes
 
}

transformed parameters{
   
   vector[J] alpha;                 // vector of school-specific intercepts
   vector[J] c;                     // intercepts' expected value

  // All SE's have half-Cauchy priors, but we sample them from a Uniform and
  // then take the tangent as it is computationally more efficient, note that
  // if X is Uniform, then Y is a std half-Cauchy
  
   vector<lower=0>[K] tau = tan(tau_unif);
   real<lower=0> sigma_alpha = tan(sigma_alpha_unif);
   real<lower=0> sigma_t = tan(sigma_t_unif);
   real<lower=0> sigma_c = tan(sigma_c_unif);
   
  // School specific slopes, since it is the transformation of a 
  // multivariate normal, we do not need to give a prior to beta, but rather
  // to z and to the variance component (tau) and the correlation component
  // of the Cholensky decomposition (L_Omega). On top of that, we utilize the
  // QR-reparametrized theta = R_ast * beta
   
   matrix[K,J] theta;            // vector of school-specific coefficients
   matrix[K,K] L_Sigma; 
  
  
   
  
  // Here we define beta, not that since it is the transformation of a 
  // multivariate normal, we do not need to give a prior to beta, but rather
  // to z and to the variance component (tau) and the correlation component
  // of the Cholensky decomposition (L_Omega)
  
    // This is theta's variance-covariance matrix
  
  L_Sigma = cholesky_decompose(R_ast*diag_pre_multiply(tau, L_Omega)*diag_pre_multiply(tau, L_Omega)'*R_ast');
  
  theta = R_ast * (gamma * u_transp) + L_Sigma * z;
  
  // intercepts' expected value
  
  for (j in 1:J) {
    c[j] = eta' * u_transp[ ,j];
  }
  
  // School specific intercepts
  
  alpha = c + sigma_alpha * z_2;
}

model {
  
 
  vector[N] m;
  vector[N] d;
  vector[N] uno;
  
  
  tau_unif ~ uniform(0,pi()/2);
  L_Omega ~ lkj_corr_cholesky(2);
  
  to_vector(z) ~ std_normal();
  to_vector(gamma) ~ std_normal();
  z_2 ~ std_normal();
  
  
  
  effect ~ normal(0,2);
  sigma_c_unif ~ uniform(0,pi()/2);          
  sigma_t_unif ~ uniform(0,pi()/2);
  sigma_alpha_unif ~ uniform(0,pi()/2);
  
  // Here we model the likelihood of our outcome variable y
  
    // Its expected value
    
  for (n in 1:N) {
    m[n] = alpha[g[n]] + Q_ast[n, ] * theta[, g[n]] + effect*w[n];
  }
  
    // Its variance
    
  uno = rep_vector(1, N);
  d = sigma_t*w + sigma_c*(uno-w);
  
  y ~ normal(m, d);

}

As you can see I have individual level variables that I gather in a matrix X to which I apply the QR.

Then I give a Normal likelihood “regression style”. But now the coefficient of the regression is not \beta anymore, but the QR transformed \theta, hence the dilemma on how to do Cholesky on \theta = R^* \beta .

I hope I was clear, otherwise let me know. By the way, the model gives reliable results so I think the procedure is correct. (interesting note: I get slightly lower ESS with QR + Cholesky)

Also does anybody know if one must partial out the constant before applying the QR?

1 Like

I’m not an expert on the QR decomposition. The only thing I see is testing this:

You can change

  L_Sigma = cholesky_decompose(R_ast*diag_pre_multiply(tau, L_Omega)*diag_pre_multiply(tau, L_Omega)'*R_ast');

to

  L_Sigma = cholesky_decompose(quad_form(multiply_lower_tri_self_transpose( diag_pre_multiply(tau, L_Omega )), R_ast'));
2 Likes