Hi, I have a question regarding our previous discussions of variational inference and the future of stan’s ADVI implementation.
I have been continuing to try to implement and understand how stan implements variational inference on very basic models (e.g. a n-dimensional mean-field Gaussian with a known diagonal covariance matrix and a mean-field Gaussian prior on the mean vector, implemented with no samples so that we would expect the posterior is equal to the prior). With the limited access to tuning parameters, I would have expected that vastly increasing the number of samples taken when computing the gradient at each iteration step would make the posterior approximation consistently much more accurate than the default of 1 Monte Carlo sample. I’ve found that this is not operating as expected and was hoping to understand more.
Previously, we discussed the issue of the inequality between the expectation of a non-linear function and the non-linear function of an expectation. What are the plans for the future of variational inference in stan? What are your concerns about stability and use?
Thanks so much in advance for your time!
That is true as far out as 250K samples to evaluate the nested integral approximating the KL divergence (and flipping sign) with a Monte Carlo estimate of the ELBO.
Adapting step size may be a problem though. I’d try that with a grid of step sizes.
Nothing at th moment. Stan doesn’t really make long-term plans. It annoyed a large number of our devs when we tried to roll out a roadmap, because you can never get consensus.
We did add Pathfinder fairly recently.
Both ADVI and Pathfinder are notoriously unstable. Especially ADVI as we have it configured.
In addition to stability, Pathfinder can stably (as in reproducibly) degenerate to an extremely over concentrated distribution (I’m really wondering if that’s a bug). The real problem with ADVI is the step size adaptation and the reparameterization gradient—we’d like to do that with the stick-the-landing gradient estimator for when we’re not using 100K draws to evaluate the ELBO.
Overall, I don’t see much future for normal approximations on the unconstrained scale. Normalizing flows are just so much better if you can afford a good GPU. So I think that’s going to be the future for VI. We have RealNVP normalizing flows that work better than NUTS in some cases (e.g., a hierarchical IRT-2PL model that’s only identified with centered priors).
I’d highly recommend the following two papers, which are ostensibly about normalizing flows for VI, but have a lot of general advice for variational inference.