I understand that Stan’s variational inference is an approximation, is experimental, etc., and it’s recommended that you use the MCMC sampler for final inferences. But sometimes you need to create predictive models on a regular basis and have constraints on the amount of computational resources you can use. In those situations one often falls back to using a MAP point estimate.
It seems likely to me that, notwithstanding its imperfections, running variational inference and then constructing a posterior predictive distribution is still better than using a MAP point estimate for prediction. “Better” in this context means that the cross-entropy tends to be lower, that is, H(p, qv) < H(p, qm), where
- p is the posterior predictive distribution obtained from an exact computation of the posterior,
- qv is the posterior predictive distribution obtained using variational inference, and
- qm is the predictive distribution using a MAP point estimate for the model parameters.
Are there any theoretical or empirical results that could confirm or refute my supposition?
Not that I know of, but Andrew’s recruiting willing participants to try to evaluate just this question. We’ll have max marginal likelihood plus (importance adjusted?) Laplace approximations as one contender.
The main problem we’ve had with ADVI is convergence or just getting the wrong answer (not wrong in that the algorithm’s buggy but wrong in that the ADVI mean isn’t very close to the actual posterior mean as measured in true posterior standard deviations). Andrew et al. are finding that it helps enormously to have everything on the unit scale. They’re also finding that when the hierarchical parameters are wrong, the posterior predictive distribution can still be quite reasonable.
The other issue is uncertainty quantification. With MLE/MML and Laplace, you just use the inverse Hessian as estimated posterior covariance. In mean-field ADVI, the posterior covariance is assumed to be diagonal; we’ve had a hard time estimating the dense form.
Hi Bob,
It is good to know someone is finally! evaluating the difference between Laplace and ADVI!!!
I have been asked about this question for a long time and cannot provide any evidence.
For your UQ thing, I would like to note that inverse Hessian is painful in high dimensions for neural network. Usually people do assumption to approximate it using Jacobian. And also on the other hand, full-rank ADVI can help something. It is interesting to see how would the result looks like.
@yuling is doing the evaluate here—there was just an arXiv paper. It wasn’t explicitly Laplace vs. ADVI. Our ADVI can be very similar to Laplace because the ADVI is a multivariate normal.
If we use full rank for the variational q(), is there any way to extract the Hessian like one can using the optim function in R? In the vb function page it appears that the values for a vb object are the same as when one uses the HMC approach (the stan function).
No idea—@bgoodri will know.
The Hessian at the solution will be used for generating unconstrained draws, which will then be transformed back to the constrained scale.