Hello,
I am using more and more variational Bayes as it gives me good result for a fraction of execution time (I develop tools for third party people often, who are generally really impatioent).
For a hierarchical model of mine I have a proportion parameter (01) as one of the final layers. As expected I get a bias is mean and CI when the mode of the proportion parameters gets close to 0 or 1 (I assume as the posterior gets more and more skewed).
The plot shows 3 separate runs (1,2,3), each dot is a mean (above) or CI (below) of an element of an array of simplexes.
I have not tried to model the bias yet, although it appears to be of sigmoidal shape. If I oberve that the pattern is similar across test data sets, I could model it allowing the use of VB for this model.
What do you think? It is something that is heavily discouraged, or could be â€śfair enoughâ€ť?
P.S. should I instead model the proportions as sumto0 unbounded array, that I can tranform to simplex?
Thanks.
1 Like
Could you tell a bit more about what kind of models and amount of data you have (no need to go in specific details of data if that is confidential) and timing differences? What are the ADVI options you use? Have you used ADVI diagnostics as described in Yes, but Did It Work?: Evaluating Variational Inference? So far Stan team doesnâ€™t know many examples where VB would be useful compared to MCMC or normal approximation at the mode, so it would be great if we can learn more.
This behavior can also be explained by VB underestimating the posterior variance. Are you using meanfield or fullrank? Meanfield ignores posterior correlations and tend to underestimate the posterior variance more.
You could try importance sampling correction, but based on figures I guess your VB approximation is too far from the true posterior so that it would work well.
3 Likes
Hello Aki,
sorry for the delay I was focused for few days on other things.
The models are some modelling on negative binomials, that for one model goes beyond using the inferred means for signal deconvolution.

modelling of negative binomial gene trancript abundance https://github.com/stemangiola/ppcSeq

deconvolution algorithm of mixed transcriptional profiles (from mixed cell types) https://github.com/stemangiola/ARMET/tree/fullbayesian
For both I get complains from VB about such diagnostics
Chain 1: 5500 2543634.749 0.029 0.008
Chain 1: 5600 2540532.215 0.020 0.007
Chain 1: 5700 2537719.434 0.015 0.007
Chain 1: 5800 2535108.193 0.011 0.006
Chain 1: 5900 2532798.265 0.009 0.006
Chain 1: 6000 2530597.443 0.008 0.006
Chain 1: 6100 2528624.935 0.007 0.005
Chain 1: 6200 2526805.279 0.007 0.005 MEDIAN ELBO CONVERGED
Chain 1:
Chain 1: Drawing a sample of size 500 from the approximate posterior...
Chain 1: COMPLETED.
Warning: Pareto k diagnostic value is 56.88.
However for (1) I get the final same results as for NUTS (in terms of post processing of the results, not necessarily about identical posterior distributions); but 7 times faster. For (2) I get the difference plotted at the beginning of this thread; that without some sort of correction or remodelling, for example using softmax unbounded reals (soft constrained sum to 0) instead of straight proportions.
Apparently on the contrary, what I see from the plots above is that the VB over estimate standard deviation of simplex (01) parameters and does not model (as expected) parameters close to 0 or 1. Am I missing something?
I am using the default Meanfield
 Paretok estimate is 56.88, ie ADVI approximation is far from true posterior and importance sampling will not help
 It seems ADVI is stopping early. The default tolerance is too high as demonstrated in Did it work paper. You can try with a smaller tolerance and more iterations. Itâ€™s possible you get a smaller Pareto k, but itâ€™s also likely that you will lose the speed benefit.
But you had to run MCMC to know that, so there is no speed benefit. Of course if you have many similar datasets, you may hope that ADVI is still good also for those datasets for which you didnâ€™t run MCMC. I know this approach is used by some when they need to to daily update of the model given new data.
You can think you are using adhoc dimensionality reduction to 1D and then model separately the probabilistic model with good calibration. This can be sometimes fine if you care only about predictions, but it makes it more difficult to understand the model and how to improve it.
You are right that this could be explained by overestimating sd, but it could be also caused by extra regularization due to early stopping advi. Did you compare the posterior marginals?
I would heavily discourage to use ADVI when it fails, and use working MCMC instead as it makes your life afterwards easier. If you have a specific application with repeated similar datasets and spend enough effort to check that from one to another dataset your adhoc model combined with postcalibration produces the needed accuracy then it might be â€śfair enoughâ€ť.