Technically, our optimizer finds penalized maximum likelihood estimates.
They’re not MAP estimates because we throw away the Jacobian in the
implicit priors on the transformed parameters. We’ll probably be
adding a proper MAP estimator soon.
Andrew likes to think of all these methods as approximate Bayes.
The penalized max likelihood gives you a Laplace approximation from
which you can gauge uncertainty in the same way as variational
inference, with the difference being centering on the mode vs.
an approximate mean.
There are lots of cases where optimization won’t work in theory
(because there’s no MLE as in a typical hierarchical regression
model) but variational inference would work (because there
is a posterior mean).
I tend to think of “stochastic” methods as meaning ones that
stream over data, like stochastic gradient descent (which can
be deterministic if you don’t randomize mini-batches).
I’m not sure what you mean by needing stochastic optimization
methods. The ADVI paper also talks about using stochastic
variational inference (in the streaming data sense), hence my
confusion about what we’re talking about.
I didn’t go into details, but that BB method you cite sounds
like it has the same motivation as the L-BFGS method we currently
use, which also approximates inverse Hessian-vector products using
gradients without computing full Hessians.
If comments are truly noise, we just ignore them.
- Bob