Hi all,

We’ve recently uploaded to ArXiv a theoretical analysis for ADVI (although the title says black-box VI), which I believe is the first formal full convergence proof for any black-box VI-type algorithm that uses SGD:

[2305.15349] Black-Box Variational Inference Converges (arxiv.org).

Interestingly, the analysis reveals some unexpected properties of the covariance parameterization that we use in practice. In particular, if we use non-linear transformations for the diagonal elements (as done for the mean-field parameterization in Stan), such as

where \mathbf{L} = \mathrm{diag}\left(L_{11}, \ldots, L_{dd}\right) is the Cholesky factor for the variational approximation, we provably lose speed. One could have achieved a \mathcal{O}\left(1/T\right) converge rates for nice posteriors but only gets \mathcal{O}\left(1/\sqrt{T}\right) instead. And if one does the same for *full-rank parameterizations*, as PyMC3, but not Stan, the ELBO might not even be convex even if the posterior is log-concave!

In our experiments, we indeed observe that the mean-field parameterization without any `exp`

or `softplus`

transformation, for enforcing the scale to be positive, converges the fastest. To me, this is one of those rare occasions where optimization theory precisely tells you what happens in practice, which is not that common, unfortunately.

Please let me know if you have any comments or questions.