i Have been playing with VI for a while using a simple logistic regression test case. What I find is you can get very accurate results always using NUTS as long as you have enough input data. For VI on the other hand you get good results on the intercept and larger coefficients but an overestimate on smaller coefficients. The same is true actually with NUTS if insufficient data is given.
I have been wondering if there is some kind of reparamterization trick that will fix this small coefficient problem, but found nothing out there. Is inaccuracy simply an unavoidable fact when using VI?
Here are a few points related to various things you mentioned:
I think the biggest problem with VI as it currently exists is the lack of reliable diagnostics (you have to fit NUTS, like you did, to be able to asses the reliability of the approximation). There is research being done on this front, e.g., https://arxiv.org/abs/1802.02538 and implementations for Stan underway soon (or already).
Especially but not only when you’re doing meanfield advi, one thing that can help is to work with a (possibly scaled) QR decomposition of the matrix of predictors rather than the predictor variables themselves. For more on that see http://mc-stan.org/rstanarm/reference/QR-argument.html and http://mc-stan.org/users/documentation/case-studies/qr_regression.html. Those describe slightly different implementations (rstanarm does additional scaling) and don’t mention VI, but it can be applied there too.
It’s not just bias in point estimates that matters but also how well the approximation is able to estimate uncertainties if you want to be able to treat the approximation as a posterior distribution you can average over
Usually VI is pretty good at finding the mode, tending to prioritize that over uncertainty calibration/tails, but it sounds like that is not the case here. Here’s my swing at a geometric narrative. The likelihoods of the logistic regression problem are similar to half-planes, so when multiplied by a Gaussian prior you get posteriors that are not dissimilar to a (soft-)truncated Gaussian, by my estimate. If the truncated domain is triangular, but the mode is in the corner, enough mass might get pushed to the center of the truncated domain to make VI place itself there instead.
Can you get overestimation with only two parameters? It might be cool to visualize the unnormalized posterior contours vs the mean-field solution. Otherwise you could try the same with a pair plot.
I am looking over all the ideas contributed here. If I understand correctly, QR decomposition comes in when there is correlation among the predictor variables. However, in my test case, all dimensions of the test data are independently generated. So I am foreseeing much mileage from QR.
I will attempt some plotting and see if I find something interesting to post.
VI tries to find the posterior mean, not the mode. These are often far away from one another when the posterior is asymmetric.
Exactly—that’s intentional and a consequence of trying to return the mean.
If you used the properly oriented KL(p||q), then sure, but that is almost never what is meant (colloquially) when VI is discussed. The KL(q||p) variational objective definitely tends to focus on posterior modes rather than means. Just consider the classical 2-component Gaussian mixture, then a Gaussian q-model will almost invariably end up capturing just one of the two modes.
That’s what we’ve always meant by “variational Bayes” (as opposed to other variational methods, which use KL differently).
I wasn’t even thinking about multimodality. Once that enters the picture, we can no longer even perform reasonable approximate Bayesian inference. The VI solutions are terrible, concentrating around a single mode precisely because of the orientation of the KL divergence used. EP, another variational method, would go the other way—there’s a great picture of this in Bishop’s book for a correlated bivariate normal, but the idea’s the same. Basically, even approximate Bayes is intractable with combinatorial multimodality. Whatever people do in practice (marginalizing continuous parameters for discrete Gibbs, marginalizing continuous parameters for HMC, variational inference, etc.) may be useful for data exploration or feature extraction using LDA or it might be useful for prediction in some settings, but it’s not going to be anything like the answer you’d get from full Bayesian inference.
I was thinking more about unimodal, but asymmetric posteriors. Like say, a beta(10, 15)—there the VI solution’s going to be closer to the posterior mean. There’s another good diagram in Bishop’s book for this.