Fast Black-box Variational Inference through Stochastic Trust-Region Optimization
Jeffrey Regier, Michael I. Jordan, Jon McAuliffe
(Submitted on 7 Jun 2017 (v1), last revised 5 Nov 2017 (this version, v2))
We introduce TrustVI, a fast second-order algorithm for black-box variational inference based on trust-region optimization and the reparameterization trick. At each iteration, TrustVI proposes and assesses a step based on minibatches of draws from the variational distribution. The algorithm provably converges to a stationary point. We implemented TrustVI in the Stan framework and compared it to two alternatives: Automatic Differentiation Variational Inference (ADVI) and Hessian-free Stochastic Gradient Variational Inference (HFSGVI). The former is based on stochastic first-order optimization. The latter uses second-order information, but lacks convergence guarantees. TrustVI typically converged at least one order of magnitude faster than ADVI, demonstrating the value of stochastic second-order information. TrustVI often found substantially better variational distributions than HFSGVI, demonstrating that our convergence theory can matter in practice.
Quality of optimal points. For 126 of the 183 models (69%), on sets of five runs, the median optimal
values found by ADVI and TrustVI did not differ substantively. For 51 models (28%), TrustVI found
better optimal values than ADVI. For 6 models (3%), ADVI found better optimal values than TrustVI.
Hard to tell if that was sarcastic or not, so I’ll provide a straight reply. If ADVI and TrustVI have different objectives, they’ll reach different optima, one of which may be a better estimate than the other.
Dan Simpson and Michael Betancourt had some comments on this when it first came out, but they were in a private email thread the summary of which is that this isn’t too surprising. Dan pointed out trust region methods will help condition Newton steps, so it’s not too surprising from an ill-conditioned optimization perspective. The real problem is what we do with VI estimates given that they’re so far off in most cases where the (unconstrained) posterior isn’t approximately Gaussian. Michael also points out that all the proofs rely on Gaussian i.i.d. assumptions, which won’t necessarily be satisfied in the kinds of smaller data problems we deal with regularly—that’s the theory behind what we saw in practice. Michael also pointed out that his SoftAbs RHMC (see the arXiv paper) is itself like a trust-region method.