Variational Bayes results seems sensible, but vary - What to change?

I am running a model that works nicely using NUTS (but is a bit slow) using variational Bayes (mostly because its a pretty simple model and I hope it would be a lot faster using VB, which for the particular use case would be really useful). I am getting sensible answers in the posterior samples, but I noticed that with different random number seeds I get more variation in e.g. median estimates that I would have expected by Monte-Carlo error alone (e.g. NUTS gives me more stable results). Thus, I suspect the algorithm did not truly converge or did not converge to the same (local?) optimum each time.

So, some of my ideas were:

  1. I could change some parameters of VB (like one often ends up putting adapt_delta to a higher value than the default and it just improves things). However, I am not sure what parameters I would logically try changing/making more stringent first? elbo samples? Tolerance? Something else?

  2. Just run the VB several times (given that full NUTS takes several min and VB a few seconds that could be an option) and average outputs (or pick some based on some criterion of better fit?)?

Does someone have some experience with this and some recommendations on what they would try?

For background on the problems with the current implementation and potential improvement see These improvements are not yet implemented in Stan, but you can see which parameters to vary and how that could affect the results. Also the paper discusses why averaging of variational parameters is sensible.


Hi @avehtari
The samples using drew from the variational inference is coming from after the optimization or during the optimization ?

Is that possible to get the learned parameters of the approximated distribution?

During the stochastic optimization draws from the approximation are used to compute gradient and elbo, but not stored. After the stochastic optimization the draws from the approximation are stored, but the parameters of the approximation are not currently stored.

1 Like

Thank you so much!

One more question. I know under hierarchical modeling, we’d better use reparameterization for HMC. But for VI, since we directly sample from the approximated distribution, my understanding is that the reparameterization is not needed or helps a little for VI. Am I right?

Reparameterization is even more important for VI, as the approximate distribution is (often) normal distribution and then we want the true posterior to be close to normal.

Thank you very much!