This came up during a recent discussion and I didn’t have a good answer.

With colleagues, I’ve been working (on and off, granted) on an integrated Laplace approximation. Given a Latent Gaussian model, with hyperparameter, \phi, latent Gaussian variable, \theta, and data, y, we compute a Laplace approximation of the log conditional distribution \log p(\theta \mid y, \phi).

In principle, we could use another approximation technique. For example variational inference using as our variational family a Gaussian with a non-diagonal covariance matrix. This however seems like it might be a more difficult optimization problem, since we need to optimize for the mean *and* the covariance matrix of the approximating Gaussian. This means solving for \mathcal O(n^2), rather than \mathcal O(n) unknowns. To run Newton’s method, I also need to calculate Hessians of the ELBO with respect to all terms; alternatively, I could use L-BFGS which is only gradient based.

What about differentiating \log p(\theta \mid y, \phi)? I would do this by applying the implicit function theorem to the ELBO, again for \mathcal O(n^2) terms, rather than \mathcal O(n). The only potential benefit I see is that I might be able to do this without computing third-order derivatives but I’d have to work out the details to confirm this.

So, the Laplace approximation allows me to compute a dense Gaussian approximation by solving an optimization problem over only estimates of the mean, while the covariance is obtained separately by computing the curvature. In contrast, VI poses an optimization problem over a larger space of unknowns.

Have there been experiments comparing the two? Happy to hear further thoughts on the subject.