Concentration of Measure with Correlation Matrices


#1

A number of people have been trying to come up with intuitive ways to convey the importance of the concept known as concentration of measure. Usually, this is done with examples involving independent Bernoulli random variables or independent standard normal random variables. Here I want to use random correlation matrices under the LKJ distribution, which provides some additional intuition iff you find linear algebra intuitive.

The PDF of the LKJ distribution is
f\left(\left.\boldsymbol{\Sigma}\right|\eta\right) = \frac{1}{c\left(K,\eta\right)}\left|\boldsymbol{\Sigma}\right|^{\eta - 1},
where \boldsymbol{\Sigma} is a K \times K correlation matrix, meaning it is symmetric, has all ones along its diagonal, and is positive semi-definite. The parameter space of correlation matrices of size K is denoted \boldsymbol{\Theta}. The normalizing constant is c\left(K, \eta\right) = \int_{\boldsymbol{\Theta}} \left|\boldsymbol{\Sigma}\right|^{\eta - 1} d\sigma_{12},d\sigma_{13},\dots = 2^{\sum_{k = 1}^{K - 1}\left(2\eta - 2 + K - k\right)\left(K - k\right)}\prod_{k = 1}^{K - 1}B\left(\eta + \frac{K - k - 1}{2}, \eta + \frac{K - k - 1}{2}\right)^{K - k}
where B\left(a,b\right) is the Beta function.

I think this example is useful because the determinant function is already a measure of volume that we raise to the power of \eta - 1 and accumulate over the space of correlation matrices to obtain the normalizing constant c\left(K, \eta\right). The normalizing constant varies as a function of K like


It is interesting that the normalizing constant actually peaks at some K and then steadily declines, meaning that the correlation matrices of size K get packed into a smaller cone within a K \choose 2 dimensional parameter space.

This packing is due to the positive semi-definiteness constraint. When K = 3, \left|\boldsymbol{\Sigma}\right| = 1 + 2\sigma_{1,2}\sigma_{1,3}\sigma_{2,3} - \left(\sigma_{1,2}^2 + \sigma_{1,3}^2 + \sigma_{2,3}^2\right). The positive semi-definiteness constraint entails that the determinant is non-negative. If both \sigma_{1,2} = 0 and \sigma_{1,3} = 0, then \sigma_{1,3} can be any number between -1 and 1. Otherwise, the range of \sigma_{2,3} is restricted to some subset of the \left(-1,1\right) interval. For a general K, we can write \boldsymbol{\Sigma} = \begin{bmatrix} \boldsymbol{\Sigma}_{-K,-K} & \boldsymbol{\sigma} \\ \boldsymbol{\sigma}^\top & 1\end{bmatrix} in partitioned form where \boldsymbol{\Sigma}_{-K,-K} is the upper-left submatrix of \boldsymbol{\Sigma} of size K - 1 and \sigma is a vector of correlations between the last variable and the previous K - 1 variables. Then, \left|\boldsymbol{\Sigma}\right| = \left|\boldsymbol{\Sigma}_{-K,-K}\right| \times \left(1 - \boldsymbol{\sigma}^\top \boldsymbol{\Sigma}_{-K,-K}^{-1} \boldsymbol{\sigma}\right). The positive semi-definiteness restriction entails that 0 \leq \boldsymbol{\sigma}^\top \boldsymbol{\Sigma}_{-K,-K}^{-1} \boldsymbol{\sigma} \leq 1, which can always be satisfied — for any non-singular \boldsymbol{\Sigma}_{-K,-K} — by making \boldsymbol{\sigma} sufficiently close to the origin but it becomes more difficult to satisfy it as \boldsymbol{\Sigma}_{-K,-K} gets farther away from the identity matrix. Thus, as K increases, the K \times K correlation matrices become increasingly concentrated around the identity matrix, which is the correlation matrix with the largest determinant, namely 1.

The expectation of \boldsymbol{\Sigma} is the identity matrix for any K and \eta > 0, which can be understood in a hand-wavy fashion by remembering that the determinant is unaffected when multiplying both the i-th row and i-th column of \boldsymbol{\Sigma} by -1. Due to this symmetry, positive off-diagonal contributions are exactly offset by negative off-diagonal contributions in \mathbb{E}\left[\boldsymbol{\Sigma}\right] = \int_{\boldsymbol{\Theta}} \boldsymbol{\Sigma} \frac{1}{c\left(K,\eta\right)}\left|\boldsymbol{\Sigma}\right|^{\eta - 1} d\sigma_{12},d\sigma_{13},\dots = \mathbf{I}. However, the mode of the PDF only exists if \eta > 1, in which case the mode is also the identity matrix. If \eta = 1, then all correlation matrices of size K have density \frac{1}{c\left(K,\eta\right)}. If \eta < 1, then there is no unique mode, but every singular \boldsymbol{\Sigma} has infinite density because its determinant is zero and raised to a negative power in the PDF. Moreover, if \eta < 1, then the PDF has a unique trough at the identity matrix, where the density is \frac{1}{c\left(K,\eta\right)}.

Going forward, let \eta = \frac{1}{2}, in which case it can be shown that the marginal PDF of each correlation is f\left(\sigma_{i,j}|\eta = \frac{1}{2}\right) = \frac{\sqrt{\pi} \Gamma\left(\frac{K - 1}{2}\right)}{\Gamma\left(\frac{K}{2}\right)} \left(1 - \sigma_{i,j}^2\right)^{\frac{K - 3}{2}}, which for K > 3 has a marginal mode at \sigma_{i,j} = 0. Here we see a paradox like many that can arise with multidimensional probability: The marginal density of each \sigma_{i,j} is maximized at zero but the joint density of \boldsymbol{\Sigma} is minimized when all of the correlations are zero and conversely, the marginal density of each \sigma_{i,j} is minimized at \pm 1 but the joint density of \boldsymbol{\Sigma} is maximized (indeed infinite) when all of the correlations are \pm 1 (rendering it singular).

Below is a plot for the K = 3 case — so each margin is actually uniform on (-1,1) — using lighter colors to represent higher values of the joint density. Near the corners of the parameter space there is more yellow and along the faces there are even some white points where the correlation matrix is almost singular, but the overwhelming majority of the points are red, indicating low density

This paradoxical result when K \geq 3 and \eta = \frac{1}{2} is a specific instance of a more general phenomenon that you cannot easily extrapolate your intuition (particularly about modes) from unidimensional spaces to “high” dimensional spaces even when “high” is a single-digit number. Moreover, seeking a mode would yield some point far removed from the mean of the distribution which can be obtained via NUTS. With correlation matrices, the positive definiteness constraint plays an important role and induces high dependence between the margins. However, when we condition on data, that can also induce high dependence between parameters, even if none of them are correlation matrices. If so, the joint mode can be far from the posterior means and / or medians and a really bad point for representing the posterior distribution as a whole.