I am using more and more variational Bayes as it gives me good result for a fraction of execution time (I develop tools for third party people often, who are generally really impatioent).
For a hierarchical model of mine I have a proportion parameter (0-1) as one of the final layers. As expected I get a bias is mean and CI when the mode of the proportion parameters gets close to 0 or 1 (I assume as the posterior gets more and more skewed).
The plot shows 3 separate runs (1,2,3), each dot is a mean (above) or CI (below) of an element of an array of simplexes.
I have not tried to model the bias yet, although it appears to be of sigmoidal shape. If I oberve that the pattern is similar across test data sets, I could model it allowing the use of VB for this model.
What do you think? It is something that is heavily discouraged, or could be “fair enough”?
P.S. should I instead model the proportions as sum-to-0 unbounded array, that I can tranform to simplex?
Could you tell a bit more about what kind of models and amount of data you have (no need to go in specific details of data if that is confidential) and timing differences? What are the ADVI options you use? Have you used ADVI diagnostics as described in Yes, but Did It Work?: Evaluating Variational Inference? So far Stan team doesn’t know many examples where VB would be useful compared to MCMC or normal approximation at the mode, so it would be great if we can learn more.
This behavior can also be explained by VB underestimating the posterior variance. Are you using meanfield or fullrank? Meanfield ignores posterior correlations and tend to underestimate the posterior variance more.
You could try importance sampling correction, but based on figures I guess your VB approximation is too far from the true posterior so that it would work well.
sorry for the delay I was focused for few days on other things.
The models are some modelling on negative binomials, that for one model goes beyond using the inferred means for signal deconvolution.
modelling of negative binomial gene trancript abundance https://github.com/stemangiola/ppcSeq
deconvolution algorithm of mixed transcriptional profiles (from mixed cell types) https://github.com/stemangiola/ARMET/tree/full-bayesian
For both I get complains from VB about such diagnostics
Chain 1: 5500 -2543634.749 0.029 0.008
Chain 1: 5600 -2540532.215 0.020 0.007
Chain 1: 5700 -2537719.434 0.015 0.007
Chain 1: 5800 -2535108.193 0.011 0.006
Chain 1: 5900 -2532798.265 0.009 0.006
Chain 1: 6000 -2530597.443 0.008 0.006
Chain 1: 6100 -2528624.935 0.007 0.005
Chain 1: 6200 -2526805.279 0.007 0.005 MEDIAN ELBO CONVERGED
Chain 1: Drawing a sample of size 500 from the approximate posterior...
Chain 1: COMPLETED.
Warning: Pareto k diagnostic value is 56.88.
However for (1) I get the final same results as for NUTS (in terms of post processing of the results, not necessarily about identical posterior distributions); but 7 times faster. For (2) I get the difference plotted at the beginning of this thread; that without some sort of correction or remodelling, for example using softmax unbounded reals (soft constrained sum to 0) instead of straight proportions.
Apparently on the contrary, what I see from the plots above is that the VB over estimate standard deviation of simplex (0-1) parameters and does not model (as expected) parameters close to 0 or 1. Am I missing something?
I am using the default Meanfield
- Pareto-k estimate is 56.88, ie ADVI approximation is far from true posterior and importance sampling will not help
- It seems ADVI is stopping early. The default tolerance is too high as demonstrated in Did it work paper. You can try with a smaller tolerance and more iterations. It’s possible you get a smaller Pareto k, but it’s also likely that you will lose the speed benefit.
But you had to run MCMC to know that, so there is no speed benefit. Of course if you have many similar datasets, you may hope that ADVI is still good also for those datasets for which you didn’t run MCMC. I know this approach is used by some when they need to to daily update of the model given new data.
You can think you are using ad-hoc dimensionality reduction to 1D and then model separately the probabilistic model with good calibration. This can be sometimes fine if you care only about predictions, but it makes it more difficult to understand the model and how to improve it.
You are right that this could be explained by overestimating sd, but it could be also caused by extra regularization due to early stopping advi. Did you compare the posterior marginals?
I would heavily discourage to use ADVI when it fails, and use working MCMC instead as it makes your life afterwards easier. If you have a specific application with repeated similar datasets and spend enough effort to check that from one to another dataset your ad-hoc model combined with post-calibration produces the needed accuracy then it might be “fair enough”.