ADVI — how does Stan invert the matrix L?

jwonlee · June 30, 2022, 3:14am

Hi all,

I’ve been reading the ADVI (automatic differentiation variational inference) paper by Kucukelbir et al, and try to understand it. I think I understand everything except the following part:

The equation is for computing the gradient of the lower triangle matrix L (Cholesky factor), which in turn defines the covariance matrix of the full-rank Gaussian (one of the variational variables the algorithm is meant to find, along with the mean of the Gaussian). The part that I don’t understand is the inversion of the matrix L. Matrix inversion scales as O(n^3) where n is the number of parameters. However, the paper argues that the time complexity of the algorithm scales linearly with the number of parameters. It seems that there must be a very efficient way to calculate the inversion of L in every iteration, but the paper does not specify. Does anyone knows how? Thank you.

EDIT: I looked into the code where the gradient of L is calculated and found the following line:

Which suggests that it is only the diagonal entries of L that are inverted and used for the gradient of L. Does anyone know why only the diagonal entries are inverted? Thank you.

maxbiostat · June 30, 2022, 11:40am

@avehtari

jsocolar · June 30, 2022, 12:25pm

By default, ADVI in stan uses the "meanfield" approximation, which assumes an approximating distribution of independent Gaussian margins. It is also possible to specify "fullrank" which uses the full covariance matrix. See here

and more explanation (not Stan-specific) here:

jwonlee · June 30, 2022, 4:38pm

Hi @jsocolar, thank you for the answer.

I do understand the difference between the mean-field model (independent parameter assumption) and the full-rank Gaussian model. As I wrote in the original post, I am only interested in the full-rank Gaussian model and the snippet of the code I included is from the normal_fullrank.hpp file, which I assume is for full-rank Gaussian approximation.

In the case of mean-field approximation, the gradient of the entropy term is just 1 as follows,

and the part of the code where the gradient of the entropy is calculated is

    omega_grad.array() += 1.0;  // add entropy gradient (unit)

which is from the normal_meanfield.hpp file.

yuling · July 1, 2022, 3:41am

the gradient of the MVN entropy is \nabla_L 1/2 \log |LL^T | =( L^{-1})^T, which generally will not be the same as (diag(L))^{-1}?

jwonlee · July 1, 2022, 3:50am

@yuling That’s exactly what I don’t understand. From the code(normal_fullrank.hpp), it seems that it’s only the diagonal elements that are inverted, but I don’t understand why.

jwonlee · July 1, 2022, 4:12am

@yuling After seeing your answer, I went back to read the paper again, specifically the appendix C where the derivative of ELBO w.r.t. L is derived.

As shown in the figure, the log determinant of (LL^T) term is where the L^-1 comes from, and it occur to me that since the determinant can be computed as squared product of diagonal elements of cholesky factor, when we take derivative w.r.t. L, we only consider the diagonal elements perhaps? Then it is strange that in the paper they actually wrote L^-1… Can anybody confirm my train of thoughts?

yuling · July 1, 2022, 4:19am

I believe L^{-T} is correct, see “Scalar-by-matrix identities” section in Matrix calculus - Wikipedia

jwonlee · July 1, 2022, 4:30am

You are right, in general, but the determinant of PSD matrix (like the covariance matrix we are estimating) can be calculated from the squared product of its Cholesky factor.

Then if we take the derivative w.r.t. L, the determinant won’t be sensitive to changes of any elements of L except the diagonal elements. Hence the derivative can be computed from only its diagonal elements?

This is my guess, so I obviously need someone to confirm this…

yuling · July 1, 2022, 4:52am

hmm, I guess you are right, log(det(A)) = sum(log(diag(cholesky(A)))). I am not entirely sure but this disagreement reminds me of this old discussion Wrong derivatives for cholesky and symmetric eigen decomposition · Issue #1803 · stan-dev/math · GitHub

spinkney · July 1, 2022, 11:25am

Unfortunately, @yuling, L^{-T} is incorrect. Since L is not a typical matrix. It is only defined on the space of lower triangular matrices. Furthermore, we know that the determinant of a PSD matrix is the product of the diagonals. When we perturb wrt to L, we are constrained by the fact that the upper-triangle cannot perturb! Thus the derivative is just the inverse of the diagonal elements and the code is correct.

Also see Derivative of log determinant of triangular matrix - Mathematics Stack Exchange.

jwonlee · July 1, 2022, 4:23pm

If L^{-T} in incorrect, would you say that the paper is wrong? Because the paper clearly says L^{-T}…

spinkney · July 1, 2022, 4:33pm

Yes, that is wrong.

Topic		Replies	Views
Intermediate between mean-field and full-rank ADVI Algorithms variational-bayes	16	2883	March 26, 2021
New Theoretical analysis for ADVI Algorithms variational-bayes , advi	0	459	May 28, 2023
Question of fullrank RStan fitting-issues , specification	1	835	July 19, 2020
Stan Variational Inference Deprecation and Documentation Developers fitting-issues	2	152	April 25, 2025
Variational inference with a mixture of dense and diagonal normals General specification , variational-bayes	3	800	November 20, 2021

ADVI — how does Stan invert the matrix L?

Related topics