Variational Bayes versus MAP for prediction

Kevin_Van_Horn · May 16, 2017, 4:52pm

I understand that Stan’s variational inference is an approximation, is experimental, etc., and it’s recommended that you use the MCMC sampler for final inferences. But sometimes you need to create predictive models on a regular basis and have constraints on the amount of computational resources you can use. In those situations one often falls back to using a MAP point estimate.

It seems likely to me that, notwithstanding its imperfections, running variational inference and then constructing a posterior predictive distribution is still better than using a MAP point estimate for prediction. “Better” in this context means that the cross-entropy tends to be lower, that is, H(p, qv) < H(p, qm), where

p is the posterior predictive distribution obtained from an exact computation of the posterior,
qv is the posterior predictive distribution obtained using variational inference, and
qm is the predictive distribution using a MAP point estimate for the model parameters.

Are there any theoretical or empirical results that could confirm or refute my supposition?

Bob_Carpenter · May 17, 2017, 7:16pm

Not that I know of, but Andrew’s recruiting willing participants to try to evaluate just this question. We’ll have max marginal likelihood plus (importance adjusted?) Laplace approximations as one contender.

The main problem we’ve had with ADVI is convergence or just getting the wrong answer (not wrong in that the algorithm’s buggy but wrong in that the ADVI mean isn’t very close to the actual posterior mean as measured in true posterior standard deviations). Andrew et al. are finding that it helps enormously to have everything on the unit scale. They’re also finding that when the hierarchical parameters are wrong, the posterior predictive distribution can still be quite reasonable.

The other issue is uncertainty quantification. With MLE/MML and Laplace, you just use the inverse Hessian as estimated posterior covariance. In mean-field ADVI, the posterior covariance is assumed to be diagonal; we’ve had a hard time estimating the dense form.

pswpswpsw · June 29, 2018, 9:24pm

Hi Bob,

It is good to know someone is finally! evaluating the difference between Laplace and ADVI!!!

I have been asked about this question for a long time and cannot provide any evidence.

For your UQ thing, I would like to note that inverse Hessian is painful in high dimensions for neural network. Usually people do assumption to approximate it using Jacobian. And also on the other hand, full-rank ADVI can help something. It is interesting to see how would the result looks like.

Bob_Carpenter · July 2, 2018, 5:37am

@yuling is doing the evaluate here—there was just an arXiv paper. It wasn’t explicitly Laplace vs. ADVI. Our ADVI can be very similar to Laplace because the ADVI is a multivariate normal.

CMM2020 · December 4, 2019, 8:26pm

If we use full rank for the variational q(), is there any way to extract the Hessian like one can using the optim function in R? In the vb function page it appears that the values for a vb object are the same as when one uses the HMC approach (the stan function).

Bob_Carpenter · December 7, 2019, 4:03am

No idea—@bgoodri will know.

The Hessian at the solution will be used for generating unconstrained draws, which will then be transformed back to the constrained scale.

Topic		Replies	Views
ADVI / Stochastic quasi-Newton methods Algorithms variational-bayes	5	1286	September 22, 2017
Different runs give me different estimated values Algorithms mcmc , variational-bayes	4	1000	July 3, 2017
Drawing output samples with ADVI Modeling	1	279	June 30, 2021
Correlated 2D Gaussian breaks ADVI Modeling fitting-issues	23	3167	July 12, 2018
What are the differences between NUTS and ADVI? General	4	1621	April 23, 2020

Variational Bayes versus MAP for prediction

Related Topics