I understand that Stan’s variational inference is an approximation, is experimental, etc., and it’s recommended that you use the MCMC sampler for final inferences. But sometimes you need to create predictive models on a regular basis and have constraints on the amount of computational resources you can use. In those situations one often falls back to using a MAP point estimate.
It seems likely to me that, notwithstanding its imperfections, running variational inference and then constructing a posterior predictive distribution is still better than using a MAP point estimate for prediction. “Better” in this context means that the cross-entropy tends to be lower, that is, H(p, qv) < H(p, qm), where
- p is the posterior predictive distribution obtained from an exact computation of the posterior,
- qv is the posterior predictive distribution obtained using variational inference, and
- qm is the predictive distribution using a MAP point estimate for the model parameters.
Are there any theoretical or empirical results that could confirm or refute my supposition?