Hello, I wanted to investigate how well the ADVI posterior matches the NUTS samples for a simple correlated Gaussian. However I found that ADVI mean-field couldn’t approximate the posterior at all and ADVI full-rank generated a reasonable looking correlated posterior but not the same one as NUTS. I am surprised that ADVI didn’t manage such a simple example but also suspicious I’m doing something wrong! In any case I thought I’d post my code here to see if anyone could correct it or explain what is going on. All the code and Rmarkdown output is here: https://gist.github.com/JohnReid/d76159ec36e86e960658c61689baeae2
Any help appreciated!
John.
Nope. This is likely true. We’re working on robustifying ADVI. But variational inference in general is not a particularly good tool for doing full posterior inference (it has a whole pile of known problems). It might get you a good point estimate sometimes, though.
I’m not sure why people are surprised by this. For mean-field Gaussian you’re approximating family is a product of Gaussians on the two axes, which, for example, can’t approximate a narrow Gaussian concentrated around the line y=x.
For the full rank one, I’d expect it to be in the correct place, but the covariance matrix to be too “concentrated”. This is because the KL divergence is an asymmetric measure of “distance” between two probability distributions and in the direction that it is used for VI, it penalises approximations that are too diffuse far more fiercely than approximations that are too concentrated. This leads to a systematic underestimation of variation using VB methods.
tl;dr: VB doesn’t really work, but might get you a central point quickly. Sometimes.
Somebody shoot me, I’ve turned into @betanalpha. [although I was of this opinion long before I knew him]
OK, what I meant was that it was surprising that it couldn’t get anywhere near the right posterior not that it wasn’t the correct posterior. I expected the marginal posteriors for $y_1$ and $y_2$ to be more or less correct but there to be no correlation in the joint posterior. What I got was all sorts of weird and wonderful posteriors (depending on the seed), none of which were anywhere near the NUTS samples.
The full-rank approximation looks good but then when compared to the NUTS samples, the correlation is incorrect which I’m afraid to say also surprised me. It didn’t look like an underestimation of variation. I expected the full-rank algorithm to fit the posterior well as there are no constrained variables and a full-rank Gaussian in the transformed parameter space would map exactly to one in the parameter space.
That might be true if you do an exact minimisation, but with the algorithm that ADVI uses (plus the fact that it’s not exactly the most stable implementation ever) makes this fairly unsurprising.
But essentially, that the ADVI approximation to the posterior isn’t anywhere near the NUTS posterior is not surprising. ADVI is an ok implementation of a flawed idea that’s been marketed well.
@anon75146577 sad to hear that’s your view. I guess this means that automatic, fast and reliable Bayesian inference for big data or more complex models is going to be hard to come by for a while yet.
I had high hopes for ADVI for models I couldn’t fit in a reasonable amount of time with NUTS, but alas even simple models (eg the Rats one I posted on the old forums) don’t work well.
Stochastic optimisation with minibatching seems like a more fruitful effort, even if it only gives point estimates, but neither stochastic optimisation not minibatching are supported in Stan yet, so I may end up writing my own code.
For what my 2c is worth, I would rather have a stochastic optimiser in Stan that works than an ADVI algorithm where I don’t know whether it works.
Well, it depends on what you mean by “works”.
ADVI itself isn’t great and we are genuinely working to make it better. But there’s a natural barrier (the VI approximation) that is likely to prevent full posterior exploration. These problems have been known for years (I am saying nothing new, novel, insightful, or that every single person who’s ever touched VI shouldn’t know already).
For complex models, it’s unlikely that this barrier will be fixed any time soon (the key here is complex).
For simple models with a lot of data, you typically end up with a posterior that looks like a very concentrated gaussian, so these sorts of methods can work. Even if they misstate the uncertainty, the point estimate may be enough. But I have serious doubts in most cases about simple models being fitted to enormous data sets. Basically, I doubt people are thinking about the problems with the data gathering mechanism in these cases. I usually expect models to get more complex as you get more data, rather than staying the same.
As for stochastic optimisation, that’s not that far from ADVI. It’s optimistic to expect VI to give much more than a point estimate anyway, so it’s really just a specific optimisation method.
If you want optimisation, Stan has BFGS which works quite well on big data (there was an example somewhere at some point of some big company [maybe facebook] using it to do something). The problem with optimisation methods is that for a hierarchical model you really do not want the joint mode. You want conditional modes. At present, Stan doesn’t let you do that, but work is underway to relax those restrictions.
(Incidentally, I strongly agree that a well-deployed point estimator is better than nothing, and in a lot of these cases as good as you can get with a reasonable computational budget. In my field [spatial statistics], we call it “pragmatic Bayes”, where you build a Bayesian model and compute something from it to the best of your ability under your particular constraints.)
This paper, for instance, gives impressive results for a Bayesian posterior approximation using SGD, and is better than ADVI: https://arxiv.org/abs/1704.04289
Yeah. But it’s easy to show impressive results in a paper. The good thing about ADVI is that people can actually interrogate it themselves. I believe this is the idea with Edward - make a framework where it’s easier for people to try out new algorithms.
I think minimal standards for this sort of research are:
- Plot out your scope realistically (ie situations where it works for sure, situations where it doesn’t work for sure, situations where you’re not sure)
- Have an implementation that other people can use to check these claims (and generally explore the algorithm)
- Devise some sort of “after the fact” checking procedure to indicate if your algorithm worked on the problem at hand.
That this isn’t the minimum standard for the field is frankly puzzling. We are not supposed to be selling used cars.
This is bang on what I’m after. Although I used to be in research I’m now in business consulting. My usual constraints are computing budget, but more importantly time. I have a fixed amount of time to solve a problem. My preference is for something that works reliably, even if approximate, over something exact that I can’t do within the time frame for solving the problem (typically weeks to a small number of months).
With AWS I can throw a lot of computing power at the problem, but even that doesn’t always help with the long model warmup times. Even with eg 36 CPUs, I’m still constrained by how fast the first few hundred iterations are.
I’m not fastidious about having exact inference intervals (although of course that’s nice). I regularly build hierarchical models, and as you say you do not want the joint mode.
The thing that I love about Stan is that it lets me build and test models really fast - orders of magnitude faster and more reliably than if I had to build the code in R. It’s succinct and reliable.
If I had something that gave me the approximate posterior mode and a rough estimate of the uncertainty, whilst integrating out the hierarchical parameters (even approximately), that would be fantastic.
The other thing with BFGS is that you can get stuck in local optima. NUTS doesn’t really suffer from this problem as much. I know that restarts (or starts from different initial values) can help here, but SGD deals with this automatically.
I don’t want to seem overly negative here - I know the huge amount of work that has gone into Stan and am grateful because of the impact it has let me have… I’m just always hopeful for the next big leap =)
SGD will also have problems with multimodality (the noise isn’t like a full
annealing thing. It will bump you out of small modes but if there is a well
separated one it won’t find it).
And there’s no problem with negativity. Stan doesn’t do this sort of model
well yet. People are working on it (just like we’re working to make ADVI as
good as it can be).
And feel free to roll up your sleeves and help if you’ve got an idea.
There’s some behind the scenes work to actually work out what is needed for
a contributed module to reduce the strain from the core devs (it’s hard to
integrate other peoples code unless they help!)
Just to be clear, this is the model
data {
cov_matrix[2] Sigma;
}
transformed data {
cholesky_factor_cov[2] L;
L = cholesky_decompose(Sigma);
}
parameters {
vector[2] y;
}
model {
y ~ multi_normal_cholesky(rep_vector(0, 2), L);
}
and here are the plots of the posterior samples:
ADVI mean-field
ADVI full-rank against NUTS
Those really look like broken. Rest of my comments assume “working” VI algorithms. Dan also provided many excellent comments.
Yes! Currently you can choose only two from “automatic, fast and reliable” for big data and generic models (and even then the “reliable” is very difficult).
In case of all VI using KL(q||p), meanfield and true posterior having correlation, you should expect the VI marginal posteriors to be close to true conditional distributions!
It’s easier when you can choose to show successful model+data examples, while most Stan users are stuck with the specific data and analysis problem. We could fix ADVI easily in Stan by using the method described in this paper Data Set Selection | Semantic Scholar
Thanks for the laugh Aki, unfortunately what I’ve seen more often is that people run a bunch of models in order to do model selection and they don’t select the ones that don’t run. I think the folk theorem might help there but in practice models that don’t run end up pruning whole branches of possible models that might fit better.
My 2c on ADVI, I’ve used it successfully for only one thing, finding OK starting points for NUTS to reduce the time that it takes to do warmup. Typically if I start from a random ADVI draw NUTS takes 5 to 20 fewer iterations to get itself into the typical set. When each iteration is potentially 10 to 20 minutes at treedepth 15 or whatever that can be a big deal.
I think you’re mostly likely to see the next big leap when we add GPU matrix operations and parallel likelihood calculations over the next year or so.
Also it looks like there is a “variational HMC” that attempts to improve computational efficiency of HMC sampling with large datasets but doing a better job than vanilla VI:
https://projecteuclid.org/download/pdfview_1/euclid.ba/1500948232
Also, there is a new subsampling approach that tries to deal with issues raised previously about this strategy:
Whether these are snake oil or not, I’ll let the experts decide (or maybe the market, as the saying goes…).
I remember reading that “energy preserving HMC” paper when it came out. I think it’s “as good as you can get” along that research direction without giving anything up (ie while staying unbiased). Requiring you keep the correct target distribution means you’re forced into a pseudomarginal algorithm, which is always quite slow.
I personally prefer just giving up on that and biasing your inference (inexact MCMC is great!). But it’s a classic speed/accuracy tradeoff.
I’ve not read the other one.
Considering that minimizing the stochastic optimization/minimizing KL divergence has asymptotic properties, would we expect that ADVI would work better as N increases and the posterior will tend towards normality?