Sum of cholesky factors vs sum of covariance

Stan beginner here, trying to understand why two models attempting to solve the same problem give different results. Thanks to everyone for their patience and generosity of time!

The problem: Suppose I have two independent, bivariate-normal distributed random variables whose means and covariance matrices I want to estimate. I have observations of just the first (base) variable, and I have observations of the first (base) + second (condition) variable (sum), but I do not have any observations of the second variable alone.

The model that works estimates the covariance of the first (base) variable and the covariance of the sum of the variables separately, then computes the covariance of the unobserved second (cond) variable as a generated quantity. The full code is at the bottom of this post; here is the relevant portion (unsure if I need a Jacobian in the generated quantities):

model {
  ...
  for (i in 1:N) {
    if (cond[i]) // y is observation of base + cond variables
      y[i] ~ multi_normal_cholesky(beta_sum, L_Sigma_sum);
    else // y is observation of base variable only
      y[i] ~ multi_normal_cholesky(beta_base, L_Sigma_base);
  }
}
generated quantities {
  // Convert cholesky factors back to correlation and variance-covariance.
  matrix[J, J] Omega_base = multiply_lower_tri_self_transpose(L_corr_base);
  matrix[J, J] Omega_sum = multiply_lower_tri_self_transpose(L_corr_sum);
  matrix[J, J] Sigma_base = quad_form_diag(Omega_base, sigma_base);
  matrix[J, J] Sigma_sum = quad_form_diag(Omega_sum, sigma_sum);

  // Infer correlation matrix of unobserved, condition-only variable.
  matrix[J, J] Sigma_cond = Sigma_sum - Sigma_base;
  matrix[J, J] Omega_cond = cov2cor(Sigma_cond);
}

I thought, this is Stan and we are Bayesians, so why not try and model the covariance of the 2nd variable directly? This seems to underestimate the covariance/correlation of the 2nd variable, yielding 0.06 instead of 0.20 (the simulated value). I’m happy the first model (above) works, but it bothers me that the model below does not. It makes me think I do not understand something important about Stan, or maybe just am bad at thinking in terms of Cholesky parameterization for covariance.

model {
  ...
  for (i in 1:N) {
    if (cond[i])
      // Is this wrong?
      y[i] ~ multi_normal_cholesky(beta + beta_cond, L_Sigma + L_Sigma_cond);
    else
      y[i] ~ multi_normal_cholesky(beta, L_Sigma);
  }
}

First model (2.4 KB)
Second model (1.3 KB)
R script to generate data (972 Bytes)

Have you tried using multi_normal instead of the cholesky version? The sum of cholesky factors is not the cholesky factor of the sum e.g \sqrt{2} + \sqrt{2} \ne \sqrt{4}.

No. Only when you apply sampling statements with nonlinear transformations e.g. log(x) ~ normal(0, 1). I think stan will warn you if you do something that it thinks might require a Jacobian transform.

1 Like

Thanks @RJTK, that was exactly my problem. I didn’t even stop to think that the Cholesky factored covariance matrix is basically the square root of the variance-covariance matrix! Parameterizing on covariance fixes the problem, although you then lose the efficiency of the Cholesky parameterization:

parameters {
  corr_matrix[J] Omega;
  corr_matrix[J] Omega_cond;
  vector<lower=0>[J] sigma;
  vector<lower=0>[J] sigma_cond;
  ...
}
model {
  sigma ~ cauchy(0, 5);
  sigma_cond ~ cauchy(0, 5);
  Omega ~ lkj_corr(1);
  Omega_cond ~ lkj_corr(1);
  ...
  for (i in 1:N) {
      y[i] ~ multi_normal(beta + cond[i] * beta_cond, Sigma + cond[i] * Sigma_cond);
  }
}

It is possible to continue using the Cholesky parameterization with the code below, but you lose most of the efficiency gain. I am not sure if there is a more efficient way to sum the Cholesky factors…

functions {
  matrix sum_chol_factors(matrix chol1, matrix chol2) {
    // Re-factoring the covariance matrix doesn't seem efficient...
    return cholesky_decompose(chol1 * chol1' + chol2 * chol2');
  }
}
...
model {
  ...
  for (i in 1:N) {
    if (cond[i])
      y[i] ~ multi_normal_cholesky(
        beta + beta_cond,
        sum_chol_factors(L_Sigma, L_Sigma_cond));
    else
      y[i] ~ multi_normal_cholesky(beta, L_Sigma);
  } 
}

A small adjustment by calling tcrossprod is marginally faster. A much faster implementation is pulling out y into y1 and y2 based on the condition and then calling the vectorized version of multi_normal_cholesky. Also I don’t think you want to calculate the decomposition for each i in y:

functions {
  matrix sum_chol_factors(matrix chol1, matrix chol2) {
    // Re-factoring the covariance matrix doesn't seem efficient...
    return cholesky_decompose(tcrossprod(chol1) + tcrossprod(chol2));
  }
}
...
model {
  ...
matrix[J, J] L_total_Sigma = sum_chol_factors(L_Sigma, L_Sigma_cond);
     y1 ~ multi_normal_cholesky(beta + beta_cond, L_total_Sigma);
     y2 ~ multi_normal_cholesky(beta, L_Sigma);
  } 
}

Y

Thanks for the advice with the cross product. Can I safely take the computation of L_sigma_cond out of the loop? Reasoning about this as C code that would certainly be possible, but I’m still getting used to Stan’s declarative style…

EDIT: To clarify, obviously if I have only two conditions I can split y into y1 and y2 and vectorize, which lifts the call to cholesky_decompose out of the loop and makes things much more efficient. However, I still have to call cholesky_decompose at least once per iteration. I’m wondering if it isn’t actually more efficient to just skip the Cholesky parameterization and model on the variance-covariance matrix with multi-normal directly.

Great. I’ve not tried it, but you may consider the following

\Sigma= \begin{bmatrix} \Sigma_x & \Sigma_{xy}\\ \Sigma_{xy} & \Sigma_y \end{bmatrix} = \begin{bmatrix} L_x & 0\\ L_{xy} & L_y \end{bmatrix} \begin{bmatrix} L_x^T & L_{xy}^T\\ 0 & L_y^T \end{bmatrix} = \begin{bmatrix} L_xL_x^T & L_xL_{xy}^T\\ L_{xy}L_x^T & L_{xy}L_{xy}^T + L_yL_y^T \end{bmatrix}

Which means that

\Sigma_x + \Sigma_y = L_xL_x^T+ L_{xy}L_{xy}^T + L_yL_y^T

So a \mathcal{N}(0, \Sigma_x + \Sigma_y) distribution would result from the sum of N_x + N_y + N_{xy} where N_x ~ multi_normal_cholesky(0, L_x) etc.

Now that I’ve written this out, I’m not actually sure it can help you? The likelihood of N_x + N_y + N_{xy} isn’t available from this.

I think L_total_Sigma belongs in the transformed parameters block.

I’m not sure I understand, it looks like you just have 1 cov matrix of size J x J. Or does your code attached have an N array of J x J covariances matrices? I can see if you’re multiplying by some scalar this might be an issue.

If you just have one matrix then it works. Otherwise, yes, you’ll have to loop it.

@RJTK, that is some sweet math, but as you point out there is no likelihood readily available.

@spinkney, I apologize I am being kind of vague with my questions. In general I am curious about expanding the simple model from this thread to more complicated cases, such as modeling difference in covariance related to a continuous variable (instead of a binary group variable as in this case).

The Stan user guide introduces the Cholesky factor parameterization as a speed/efficiency consideration since multi_normal and lkj_corr need to factor the variance-covariance matrix internally. There is also mention of numerical stability without much justification given. There is some discussion of this on this forum.

However, in my case it looks like I need to factor the variance-covariance matrix at least once per iteration anyway, plus calling tcrossprod. If I skip re-parameterization and just use the variance-covariance matrix directly I will end up factoring the matrix at least twice per iteration (once for the prior and once for the likelihood), but I will skip the calls to trcrossprod, avoid the generated quantities section, and make my model somewhat more readable.

I guess I’m kind of arguing with myself here, but wondering if there are any expert opinions about the merits of using the Cholesky factor parameterization for this sort of model where it is impossible to avoid factoring altogether.