I’ve been reading the ADVI (automatic differentiation variational inference) paper by Kucukelbir et al, and try to understand it. I think I understand everything except the following part:
The equation is for computing the gradient of the lower triangle matrix L (Cholesky factor), which in turn defines the covariance matrix of the full-rank Gaussian (one of the variational variables the algorithm is meant to find, along with the mean of the Gaussian). The part that I don’t understand is the inversion of the matrix L. Matrix inversion scales as O(n^3) where n is the number of parameters. However, the paper argues that the time complexity of the algorithm scales linearly with the number of parameters. It seems that there must be a very efficient way to calculate the inversion of L in every iteration, but the paper does not specify. Does anyone knows how? Thank you.
EDIT: I looked into the code where the gradient of L is calculated and found the following line:
Which suggests that it is only the diagonal entries of L that are inverted and used for the gradient of L. Does anyone know why only the diagonal entries are inverted? Thank you.
By default, ADVI in stan uses the "meanfield" approximation, which assumes an approximating distribution of independent Gaussian margins. It is also possible to specify "fullrank" which uses the full covariance matrix. See here
I do understand the difference between the mean-field model (independent parameter assumption) and the full-rank Gaussian model. As I wrote in the original post, I am only interested in the full-rank Gaussian model and the snippet of the code I included is from the normal_fullrank.hpp file, which I assume is for full-rank Gaussian approximation.
In the case of mean-field approximation, the gradient of the entropy term is just 1 as follows,
@yuling That’s exactly what I don’t understand. From the code(normal_fullrank.hpp), it seems that it’s only the diagonal elements that are inverted, but I don’t understand why.
As shown in the figure, the log determinant of (LL^T) term is where the L^-1 comes from, and it occur to me that since the determinant can be computed as squared product of diagonal elements of cholesky factor, when we take derivative w.r.t. L, we only consider the diagonal elements perhaps? Then it is strange that in the paper they actually wrote L^-1… Can anybody confirm my train of thoughts?
Then if we take the derivative w.r.t. L, the determinant won’t be sensitive to changes of any elements of L except the diagonal elements. Hence the derivative can be computed from only its diagonal elements?
This is my guess, so I obviously need someone to confirm this…
Unfortunately, @yuling, L^{-T} is incorrect. Since L is not a typical matrix. It is only defined on the space of lower triangular matrices. Furthermore, we know that the determinant of a PSD matrix is the product of the diagonals. When we perturb wrt to L, we are constrained by the fact that the upper-triangle cannot perturb! Thus the derivative is just the inverse of the diagonal elements and the code is correct.