Speedups for Wishart model

Is there a way that can be used to define the Wishart density directly on the Cholesky factor? I keep forgetting to ask @bgoodri about this.

https://en.wikipedia.org/wiki/Wishart_distribution

X = L * A * A^T * L^T = L * A * (L * A)^T

And from:
https://en.wikipedia.org/wiki/Cholesky_decomposition


“The Cholesky decomposition is unique when A is positive definite; there is only one lower triangular matrix L with strictly positive diagonal entries such that A = LL*. However, the decomposition need not be unique when A is positive semidefinite.”

So, L * A (1st citation) gives us the Cholesky decomposition of X. (LA is triangular, and uniqueness, if sym.psd)

Uniqueness of Cholesky (with positive diagonals) is why we can use it for the unconstrained representation of positive-definite matrix (type cov_matrix in Stan).

My question is whether we can implement wishart_cholesky(eta, L_V) where L_V is Cholesky factor of a positive-definite matrix with efficient derivatives. Ideally, the outcome would itself be a Cholesky factor, but it could also be a whole positive-definite matrix. So that’d be

wishart_cholesky(cov_matrix X, real eta, cholesky_cov L_V)

wishart_cholesky(cholesky_cov L_X, real eta, cholesky_cov L_V)

To do that efficiently, we need to be able to express the Wishart in terms of Cholesky factors and efficiently compute derivatives.

Inverse Wishart would also be useful.

V is your scale matrix, and L is the lower cholesky decomposition matrix of V. The scale matrix is usually constant. It is the a-prior assumption to the “covariance”/DF factor. The eta in terms of LKJ would be a scale to the diagonal elements of V in wishart. eta in Stan is mostly also chosen a-priori.

The inverse-Wishart requires an inverse of the scale matrix. https://en.wikipedia.org/wiki/Inverse-Wishart_distribution
The problem was the barlett-decomposition and I ended up with an upper triangular matrix. I solved this case
by applying a permutation matrix to the scale matrix beforehand and then a rearrangement in Stan instead of applying the permutation matrix again, see also: https://math.stackexchange.com/questions/1372166/a-ul-factorization
I did this 1 1/2 years back, still have the code though, but this needs some checking. Before somebody start to laugh about this approach, I better not post.

I agree that the typical use case has eta and L_V as data (double values) and X as a parameter (stan::math::var—the autodiff type).

We can parameterize however is convenient (scale matrix or inverse scale matrix) analogously to how we parameterize the multivariate normal.

Also, if we have this compound:

data {
  cov_matrix[K, K] V;
  real<lower = 0> nu;
  ...
parameters {
  cov_matrix[K] Sigma;
model {
  Sigma ~ inv_wishart(nu, V);  // or V_inv if that's better
  y ~ multi_normal(mu, Sigma);

isn’t the posterior for Sigma also inverse Wishart? If so, can we just code an inverse Wishart for the posterior directly and avoid having to compute the inverse Wishart prior and multivariate normal likelihood?

Bob,

https://en.wikipedia.org/wiki/Inverse-Wishart_distribution
see subsection conjungate distribution. I hope you mean that.

There are some nice variants of IW here:
https://arxiv.org/pdf/1408.4050.pdf

Yes.

Rather than scaled inverse wishart, we tend to scale an LKJ prior on the correlation matrix. In practice, that’s done with a Cholesky factor for the correlation matrix, which is more efficient and numerically stable. There’s no conjugacy to exploit there.

What I’d like to be able to do is have everything done with Cholesky factors for the Wishart cases if possible. Something like

data {
  cov_matrix[K] Psi;
  ...
parameters {
  cholesky_factor_cov[K] L;

model {
  mu ~ multi_normal(...);
  L ~ cholesky_inv_wishart(nu, Psi);
  y ~ multi_normal_cholesky(mu, L);

Then, given that everything’s conjugate, it’d be nice to have an even more efficient expression for L and mu in the posterior that just involves a single Wishart term for L exploiting conjugacy and a single term for mu also exploiting conjugacy. I think that’d be possible. Not sure how to write the function down since I’m not that good with these distributions. But the point is that you don’t need to write the joint log density, you can write anything proportional to the posterior log density.

“is more efficient and numerically stable”, only for small dimensions.
I experienced a log(0) errors of Stan, using dimension around 50 for LKJ. See also: Ignacio Alvarez,
Bayesian Inference for a Covariance Matrix:
“In high dimensions (d > 10), the LKJ prior concentrates mass near zero correlations and therefore we do not use it here.” (2.4) link above

The Wishart/InvWishart, especially the Barlett decomp. of Wishart is really fast to sample.
I’m not against LKJ, its great for low dimensions and smaller models.

“What I’d like to be able to do is have everything done with Cholesky factors for the Wishart cases if possible. Something like”

I have to think about some time and read some papers.

The Normal Inverse Wishart Distribution is defined here:

https://math.stackexchange.com/questions/2469310/computing-the-normalisation-of-posterior-normal-inverse-wishart

details here:

http://www.kamperh.com/notes/kamper_bayesgmm13.pdf

For Stan, using the Cholesky factor is always more efficient and stable. If you declare a variable as a covariance matrix, as in

cov_matrix[K] Sigma;

then it is represented as a Cholesky factor (with log diagonals) under the hood in unconstrained parameter space. Then it’s multiplied by its transpose to get Sigma and the log Jacobian of the product and exponentiation is added to the target. Then it gets factored and the derivative of that is calculated.

For the reparameterization of the inverse Wishart distribution, how is the cholesky factor of the inverse Wishart distribution with scale V passed to multi_normal_cholesky? Assuming I use the reparameterization suggested by the Stan reference,

Would it be:

multi_normal_cholesky(mu, A_inv_L_inv);

Thank you.

I don’t think we have built-ins for any of that. Our multi-normal on the Cholesky scale takes a covariance matrix, not a precision matrix.

There may be a clever way to go from Cholesky factor of covariance to Cholesky factor of precision. It’d be better if we just had a multi-normal Cholesky precision distribution. You could write that one yourself, but there’s still a log determinant term that I also think we have not specialized to Cholesky factors.

@bgoodri and @anon75146577 and @rtrangucci know much more about this than me (so do a lot of other people, but the three listed above are actively working on building out our general matrix libs efficiently).

It got merged


although you said that you were going to generalize the testing framework in light of it. In any event, it is not exposed to the Stan language in any released version of Stan yet, but it is just algebra and index fiddling.

That still needs to be done, but in the interim, we can add the function. I assigned myself:

With all respect isn’t that goodmattg refering to the inverse of the Cholesky decomposition of the
Wishart distribution, also called inverse-Wishart? And isn’t the inverse of a triangular matrix easy solved
by backward substitution as recommended here?
https://stackoverflow.com/questions/25662643/what-is-the-way-to-invert-a-triangular-matrix-in-r

You don’t need much. I’m an admitted matrix amateur!

Apparently it is, hence Ben’s chol2inv. We’ll get that exposed in Stan for the next release so it’s a built-in.

Any idea when this will get exposed to Stan?

Btw, I’m getting this warning

and, yes, it’s exactly what I want in this case!

There was a longstanding issue for this that was literally a 3-minute fix.

However this will need docs before merging, so if anyone wants to write a few lines for the user docs, that would be great. I am slow with docs and quite busy right now, so would appreciate any help.

It can probably be added to the Cholesky decompose subsubsection in https://mc-stan.org/docs/2_25/functions-reference/linear-algebra-functions-and-solvers.html
Nothing fancy, just a short description.

1 Like

It was intended to be the same as chol2inv in R / LAPACK / everything else, except that in Stan Cholesky factors are lower triangular rather than upper triangular.