LKJ inverse transform for correlation matrices (incorrect documentation?)

Hello,

I think there is an inconsistency/error in the Stan reference manual for the LKJ inverse transform for correlation matrices.

The manual gives 2 formulations for the transform. I refer to them in the order in which they appear in the manual.

Let’s first focus on w_{2,2}. The first formulation gives us:

w_{2,2} = (1 - z_{1, 2}^2)^{1/2}

The second formulation gives us:

w_{1,2} = z_{1, 2}
w_{2,2} = z_{2, 2} * w_{1,2} * (1 - z_{1, 2}^2)^{1/2}
w_{2,2} = z_{2, 2} * z_{1,2} * (1 - z_{1, 2}^2)^{1/2}

z_{2, 2} = 0 in the upper triangular matrix above these expressions. Therefore, w_{2,2} = 0, which is incorrect as the diagonal terms need to be >0. In addition, these expressions are not consistent as z_{1,2} is not necessarily 1.

Now let’s focus on w_{2,3}. The first formulation gives us:

w_{2,3} = z_{2, 3} * (1 - z_{1, 3}^2)^{1/2}

The second formulation gives us:

w_{1,3} = z_{1, 3}
w_{2,3} = z_{2, 3} * w_{1,3} * (1 - z_{1, 3}^2)^{1/2}
w_{2,3} = z_{2, 3} * z_{1, 3} * (1 - z_{1, 3}^2)^{1/2}

These two expressions are inconsistent as z_{1, 3} is not necessarily 1.

Please let me know if I have misunderstood the documentation and if not, which is the correct formulation.

Anirban

Here is the link to the relevant section: https://mc-stan.org/docs/2_25/reference-manual/correlation-matrix-transform-section.html#correlation-matrix-transform.section

To clarify, is this the difference between the equations starting after the line “In Stan, the LKJ transform” and the equation starting after the line “This does not require as much computation”?

Yeah I agree the second one looks off because z_{i, i} = 0 and that will make w_{i, i} = 0 everywhere but w_{1, 1} (which isn’t obviously true in the version above). (Also you can surround things by dollar signs to get the nice rendering).

@bgoodri can you look at this?

Yes, that is correct, those are the equations I am referring to. (In future posts, I will use the dollar signs to get the math rendering; thanks!).

I think the position of the square roots without parentheses is misleading, even though it might be correct under whatever operator precedence computer scientists use. The code is pretty short

but the documentation manages to not correspond to it.

Disclaimer: I am unfamiliar with the code base and this general subject area. I also use Armadillo and not Eigen libraries, and code in R and not C++. Therefore, I may be wrong in my understanding of the code and/or the math.

This issue is discussed in chapter 3 of the Vine Copula handbook (“Vines Arise”; Roger M. Cooke, Harry Joe and Kjersti Aas; available from Roger Cooke’s website): http://rogermcooke.net/rogermcooke_files/Vines%20Arise%20Handbook%20VCM.pdf. The bottom of page 15 provides an equation for the lower Cholesky in terms of the partial correlations. Below I offer some observations and suggestions based on the equation in the chapter.

  1. The Stan code computes the lower Cholesky while the Stan documentation describes the upper Cholesky. In particular, the code computes j < i while the documentation describes i < j. Either is fine but that mismatch may be where some of the confusion is stemming from.

  2. In the code, the diagonal term is computed separately from terms left of the diagonal (please see line 40 after the loop). I think the line of code (line 40) is almost correct:

    x.coeffRef(i, i) = sqrt(1.0 - sum_sqs);"
    

Please note that this line of code is inconsistent with both equations in the Stan documentation. This line of code is almost consistent with the equation in the chapter referenced above. I think the code should not have the sqrt.

  1. I am unsure as to why the code computes and uses the sum of squares inside the loop. I think that is incorrect. The sum of squares term is needed for the diagonal term but not for the off diagonal terms that are computed inside the loop. I believe the term that is needed for what is computed inside the loop is \prod_{k=1}^{j-1}\sqrt{(1 - z_{i,k})}. Here is my implementation:

    x.coeffRef(i, 0) = z.coeff(k);
    T_scalar prod_sqs = sqrt(1 - square(z.coeff(k++)));
    T_scalar sum_sqs = square(x.coeff(i, 0));
    for (int j = 1; j < i; ++j) {
        x.coeffRef(i, j) = z.coeff(k) * prod_sqs;
        prod_sqs = prod_sqs * sqrt(1 - square(z.coeff(k++)));
        sum_sqs = square(x.coeff(i, j));
    }
    x.coeffRef(i, i) = 1 - sum_sqs;
    

prod_sqs tracks the product of the square root of the partial correlation terms (not the Cholesky terms) in each row until 1 column left of each term. sum_sqs tracks the sum of the Cholesky terms in each row until 1 column left of each term. The diagonal is 1 - sum_sqs because the sum of the squares of all terms in each row must add to 1. Any term that is right of the diagonal is 0 as this is a lower triangular matrix. By construction sum_sqs < 1, which implies (1 - sum_sqs) > 0. Hence the diagonal terms are > 0. Therefore, the constructed matrix matches the conditions required to be the lower Cholesky of a correlation matrix.

I tried to be careful. I do apologise for the noise if I am incorrect in any regard.

1 Like

No need to apologize. Critique on this stuff is super welcome. There are inevitably mistakes and we wanna fix em’.

That said this seems complicated so I’ll have to come back to in the morning lol.

2 Likes

I think the code:

x.coeffRef(i, i) = sqrt(1.0 - sum_sqs)

is correct and there is an error in the Vines book too lol, though I don’t think the decomposition in the Vines book is what is happening here. I don’t think we’re actually doing a matrix decomposition – we’re doing a mapping from unconstrained space z to correlation matrices. The equations/code here kinda looks like a Cholesky factorization or something but it’s not – we’re just building an correlation matrix from an unconstrained space.


So the sqrt(1 - sum_sqs) thing should be right because for a correlation matrix C, we know C_{ii} = 1.0. This means if we have a factorization C = L L^T, the ith row of L dotted with itself is 1.0.

So for any row we have \sum_j L_{ij}^2 = 1.0, and if we know all the L_{ij} other than L_{ii} we can use this eq. to solve for it and get L_{ij} = \sqrt{1.0 - \sum_{j \neq i} L_{ij}^2}.


But yeah good find the docs need updated to reflect the code.

I see your point on the diagonal terms and agree: for the resulting correlation matrix to have 1 on the diagonal, \sum_jL_{ij}^2=1.0, \forall i must hold in the lower Cholesky.

My broader concern is the following. In order to compute the LKJ density of the correlation matrix from the partial correlations, the Cholesky of the correlation matrix has to be built up from the partial correlations. I believe there is only one such bijection though how one computes or constructs the bijection may vary.

The current code builds a Cholesky from variables that are range constrained (using tanh) between -1 and 1. In order for those variables to be meaningful as partial correlations, the bijection in the code has to be the bijection described in LKJ (2009) as only that bijection is from the partial correlations to the correlation matrix. Else other bijections may exist from other range constrained variables to a correlation matrix, where the range constrained variables are not partial correlations. In such cases, however, the determinant of the resulting correlation matrix may not be the same function of the range constrained variables as the expression using the partial correlations given in Theorem 1 of LKJ (2009), which is used to compute the LKJ density.

With that in mind, I think it would be very useful if the documentation were to provide some more guidance on how the code relates to the construction in LKJ (2009). The original paper (and the related literature) is quite general and hence it is difficult (for me) to follow how the code implements the construction in the paper for the off-diagonal terms (I follow for the diagonal terms).

My interest stems in part from curiosity but also because I am working on a non-Stan implementation and I would like to implement as close to Stan as possible given Stan is a (de facto) benchmark implementation in Bayesian statistics. That’s what brought me to the Stan documentation.

2 Likes

I’m not sure these variables -1 to 1 are correlations. @bgoodri, are they? I spent some time trying this morning to build a correlation matrix and do a decomposition and get this to match the constructed lower triangular matrix but I wasn’t able to do it. That’s what made me think that we’re not really doing a decomposition to build the lower triangular matrix here – we’re building it directly from some other weird unconstrained space. I could easily be wrong.

1 Like

They are partial correlations (on a canonical vine). The relationship between these partial correlations (\Omega_{ij}) and the off-diagonal elements of a correlation matrix (\Sigma_{ij}) is shown in

But we map the partial correlations to \mathbf{L} rather than \boldsymbol{\Sigma} = \mathbf{L} \mathbf{L}^\top.

Oh a partial correlation is different than a correlation. Huh. Do you mind taking a swing at fixing the docs here?

1 Like

The following is my implementation based on my understanding of Stan’s workings.

Step 1: Using tanh, map a vector of reals (the sampled variables in NUTS) to a vector of partial correlations.

Step 2: Using a bijection/transformation, map the vector of partial correlations to the lower triangular Cholesky of a correlation matrix.

Step 3: Using matrix multiplication, compute the correlation matrix from the lower triangular Cholesky.

Step 4: The LKJ density of the correlation matrix is a function of the determinant of the correlation matrix. This determinant can be expressed using the partial correlations without computing the determinant of the correlation matrix. Using this fact, we can compute the LKJ density given the partial correlations. This is incredibly useful as the transformation in Step 2 is not easily differentiable but the expression given the partial correlations is.

Step 5: Use the Jacobian of Step 1 to compute the density of the sampled variables in NUTS from the density of the resulting correlation matrix (computed using partial correlations).

This works beautifully (and is incredibly elegant!) as long as the bijection in Step 2 is the bijection from the partial correlations to the lower Cholesky. Any other bijection may yield an admissible lower Cholesky of a correlation matrix but will lead to an inaccurate computation of the LKJ density as the estimated transformed parameters will not be the correct partial correlations. We know this has to be true because there is a 1-to-1 correspondence between the partial correlations and the correlation matrix. Therefore, there can be only 1 transformation from the partial correlations to the correlation matrix and any other transformation will yield estimates of a different set of range-constrained variables.

Step 1, 3, 4, and 5 are clear. Step 2 is where I am unclear. I have written to Roger Cooke who wrote the handbook for his suggestions and I look forward to reading the Stan documentation when ever that is available (no pressure, I know the Stan community and Stan developers have a lot on their plate).

Thanks much!

1 Like

That is all correct and basically the motivation for why we have promoted using the Cholesky factor of a correlation matrix rather than the correlation matrix itself (also things are more numerically stable).

If \boldsymbol{\Sigma} =\mathbf{L} \mathbf{L}^\top is a correlation matrix and \mathbf{L} is its lower-triangular Cholesky factor, then as @bbbales2 mentioned, its rows must have unit length in order for the diagonal elements of \boldsymbol{\Sigma} to be 1. Since the i-th row of \mathbf{L} has unit-length, its elements can be expressed in unit-hyperspherical coordinates (i.e. in terms of \theta_{ij} \in \left(0,\pi\right)) as
\begin{eqnarray*} L\left(\boldsymbol{\theta}\right)_{ij} & = & \begin{cases} 1 & 1=i=j\\ \prod_{k=1}^{j-1}\sin\left(\theta_{kj}\right) & 1<i=j\\ \cos\left(\theta_{i1}\right) & 1=i<j\\ \cos\left(\theta_{ij}\right)\prod_{k=1}^{j-1}\sin\left(\theta_{kj}\right) & 1<i<j\\ 0 & i> j \end{cases} \end{eqnarray*}
Let \omega_{ij} = \cos\left(\theta_{ij}\right) be a partial correlation on a canonical vine, then 0 < \sin\left(\theta_{ij}\right) = \sqrt{1 - \cos\left(\theta_{ij}\right)^2} = \sqrt{1 - \omega_{ij}^2}. Then, the elements of the Cholesky factor in terms of partial correlations on a canonical vine become
\begin{eqnarray*} L\left(\boldsymbol{\omega}\right)_{ij} & = & \begin{cases} 1 & 1=i=j\\ \prod_{k=1}^{j-1}\sqrt{1 - \omega_{kj}^2} & 1<i=j\\ \omega_{i1} & 1=i<j\\ \omega_{ij}\prod_{k=1}^{j-1}\sqrt{1 - \omega_{kj}^2} & 1<i<j\\ 0 & i > j \end{cases} \end{eqnarray*}

That is what the code at


does and the documentation butchers. It sort of remains to be shown that \omega_{ij} “is” a partial correlation on a canonical vine, but you can get a hint that is true from the relation in the LKJ paper that \left|\boldsymbol{\Sigma}\right| = \prod_{i = 1}^{K - 1} \prod_{j = i + 1}^K \left(1 - \omega_{ij}^2\right), which is also equal to 1 \times \prod_{j = 2}^K L\left(\boldsymbol{\omega}\right)_{jj}^2 = 1 \times \prod_{j = 2}^K \prod_{k=1}^{j-1}\left(1 - \omega_{kj}^2\right) since all {K \choose 2} elements of \boldsymbol{\omega} appear exactly once in both expressions.
1 Like

Thanks so much for the detailed response. A few observations:

  1. L(w)_{ij} as you describe it is an upper triangular matrix and not a lower triangular matrix. i > j are the terms below the diagonal. i < j are the terms above the diagonal. In your equation, i > j cases have value = 0.

In my observations below, I use U to make clear that I am working from your equations but that this is an upper triangular matrix. I also switch row to column to use your equations as written when doing so and transpose the order of i and j when discussing the lower triangular matrix.

  1. I think you might need to look at the 1 < i = j case again. For example, let’s focus on the third column of the upper Cholesky. We need: U(w)_{13}^2 + U(w)_{23}^2 + U(w)_{33}^2 = 1 in order for the (3,3) term of the correlation matrix to be 1. Working from your equations, this would imply: w_{13}^2 + w_{23}^2(1-w_{13}^2) + (1-w_{13}^2)(1-w_{23}^2) = 1, which will not hold generally when we estimate the partial correlations. I think this term should be: U(w)_{33} = \sqrt{1-U(w)_{13}^2-U(w)_{23}^2}, which guarantees that the sum of the squared terms on the third column is equal to 1. The transpose of this is in the code.

  2. As far as I can tell the code does not implement your equations. Let’s focus on the fourth row. We need to transpose your notation to the case of the lower triangular matrix. The code gives us:

L(w)_{41}=w_{41}, which is the transpose of what you have.
L(w)_{42}=w_{42}\sqrt{1-L(\omega)_{41}^2}=w_{42}\sqrt{1-w_{41}^2}, which is the transpose of what you have.
L(w)_{43}=w_{43}\sqrt{1-(L(\omega)_{41}^2+L(\omega)_{42}^2)}, which is not the transpose of what you have.

Note that sum_sqs is tracking the sum of the square of the cells to the left of each cell as the matrix is filled in by row. After filling in a cell, sum_sqs adds the square of the cell to itself. For each cell between the second column and the column corresponding to the diagonal, the code assigns partial correlation * sqrt(1 - sum_sqs) to the cell. The diagonal cell is sqrt(1 - sum_sqs).

Its the cells that are 2 < j < i that are affected as the first column is the partial correlation itself and therefore the second column is also fine. For these cells, I think the code has to have a second accumulator variable that tracks the product of \sqrt{1-w_{ik}^2}, k<j in addition to the accumulator variable that tracks the sum of the cells squared, which is needed for the diagonal term.

Roger Cooke and Dorota Kurowicka wrote back to me confirming that there is a typo in the chapter and that we need the square root for the diagonal elements.

1 Like