Stan Variational Inference Deprecation and Documentation

madelynandersen · April 22, 2025, 7:52pm

Hi, I have a question regarding our previous discussions of variational inference and the future of stan’s ADVI implementation.

I have been continuing to try to implement and understand how stan implements variational inference on very basic models (e.g. a n-dimensional mean-field Gaussian with a known diagonal covariance matrix and a mean-field Gaussian prior on the mean vector, implemented with no samples so that we would expect the posterior is equal to the prior). With the limited access to tuning parameters, I would have expected that vastly increasing the number of samples taken when computing the gradient at each iteration step would make the posterior approximation consistently much more accurate than the default of 1 Monte Carlo sample. I’ve found that this is not operating as expected and was hoping to understand more.

Previously, we discussed the issue of the inequality between the expectation of a non-linear function and the non-linear function of an expectation. What are the plans for the future of variational inference in stan? What are your concerns about stability and use?

Thanks so much in advance for your time!

Bob_Carpenter · April 24, 2025, 8:14pm

That is true as far out as 250K samples to evaluate the nested integral approximating the KL divergence (and flipping sign) with a Monte Carlo estimate of the ELBO.

Adapting step size may be a problem though. I’d try that with a grid of step sizes.

Nothing at th moment. Stan doesn’t really make long-term plans. It annoyed a large number of our devs when we tried to roll out a roadmap, because you can never get consensus.

We did add Pathfinder fairly recently.

Both ADVI and Pathfinder are notoriously unstable. Especially ADVI as we have it configured.

In addition to stability, Pathfinder can stably (as in reproducibly) degenerate to an extremely over concentrated distribution (I’m really wondering if that’s a bug). The real problem with ADVI is the step size adaptation and the reparameterization gradient—we’d like to do that with the stick-the-landing gradient estimator for when we’re not using 100K draws to evaluate the ELBO.

Overall, I don’t see much future for normal approximations on the unconstrained scale. Normalizing flows are just so much better if you can afford a good GPU. So I think that’s going to be the future for VI. We have RealNVP normalizing flows that work better than NUTS in some cases (e.g., a hierarchical IRT-2PL model that’s only identified with centered priors).

I’d highly recommend the following two papers, which are ostensibly about normalizing flows for VI, but have a lot of general advice for variational inference.

avehtari · April 25, 2025, 5:22pm

This phenomena is explained in the paper Challenges and Opportunities in High Dimensional Variational Inference

Topic		Replies	Views
Cmd stan 2.18 Developers	9	633	August 15, 2018
Correlated 2D Gaussian breaks ADVI Modeling fitting-issues	23	3368	July 12, 2018
For rstan used in other package, perhaps optionally remove or soften "This procedure has not been thoroughly tested and may be unstable or buggy." Interfaces rstan	5	232	March 27, 2024
ADVI: Too many dropped evaluations even for well behaved models Developers advi	7	1304	October 15, 2021
Why does ADVI use stochastic gradient ascent not LBFGS Algorithms	10	1641	July 22, 2018

Stan Variational Inference Deprecation and Documentation

Related topics