Question on variational inference

techniques

#1

i Have been playing with VI for a while using a simple logistic regression test case. What I find is you can get very accurate results always using NUTS as long as you have enough input data. For VI on the other hand you get good results on the intercept and larger coefficients but an overestimate on smaller coefficients. The same is true actually with NUTS if insufficient data is given.

I have been wondering if there is some kind of reparamterization trick that will fix this small coefficient problem, but found nothing out there. Is inaccuracy simply an unavoidable fact when using VI?


#2

Here are a few points related to various things you mentioned:

  • I think the biggest problem with VI as it currently exists is the lack of reliable diagnostics (you have to fit NUTS, like you did, to be able to asses the reliability of the approximation). There is research being done on this front, e.g., https://arxiv.org/abs/1802.02538 and implementations for Stan underway soon (or already).

  • Especially but not only when you’re doing meanfield advi, one thing that can help is to work with a (possibly scaled) QR decomposition of the matrix of predictors rather than the predictor variables themselves. For more on that see http://mc-stan.org/rstanarm/reference/QR-argument.html and http://mc-stan.org/users/documentation/case-studies/qr_regression.html. Those describe slightly different implementations (rstanarm does additional scaling) and don’t mention VI, but it can be applied there too.

  • It’s not just bias in point estimates that matters but also how well the approximation is able to estimate uncertainties if you want to be able to treat the approximation as a posterior distribution you can average over


#3

Usually VI is pretty good at finding the mode, tending to prioritize that over uncertainty calibration/tails, but it sounds like that is not the case here. Here’s my swing at a geometric narrative. The likelihoods of the logistic regression problem are similar to half-planes, so when multiplied by a Gaussian prior you get posteriors that are not dissimilar to a (soft-)truncated Gaussian, by my estimate. If the truncated domain is triangular, but the mode is in the corner, enough mass might get pushed to the center of the truncated domain to make VI place itself there instead.
Can you get overestimation with only two parameters? It might be cool to visualize the unnormalized posterior contours vs the mean-field solution. Otherwise you could try the same with a pair plot.


#4

I am looking over all the ideas contributed here. If I understand correctly, QR decomposition comes in when there is correlation among the predictor variables. However, in my test case, all dimensions of the test data are independently generated. So I am foreseeing much mileage from QR.

I will attempt some plotting and see if I find something interesting to post.