I’m moving this to Discourse because that’s where we’re having
dev discussions (cc-ed Dustin in case he’s not on Discourse
The first step is to make sure that GMO works well enough that
we want to include it in Stan. Do you think you’re done with
that part? How robust is it relative to ADVI and MCMC across the
hundreds of test models we have?
Once you’re convinced that the algorithm’s evolved enough to
go into Stan, the first thing Michael suggested to do is design
what the GMO API is going to look like from a client perspective.
The current services are in:
You can also look at something like the RStan interface and
think about it from that perspective.
The moving to C++ was because Andrew kept bringing up that
things were slow because they were having to go through RStan’s
I/O. But I’m not sure that’s right and the problem isn’t deeper
in that you really want only the derivatives you need. The
thing to test there would be derivative speed for a model with
data already loaded for all of the parameters or just the global
parameters (with local parameters fixed as data, so it still
needs two models). That’ll give you an idea of how much gain
there is to be had in speed.
Then we need second-order autodiff if you need Hessians. We
could use finite diffs, but that’s slow. But we have zero people
willing to work on second- and higher-order autodiff. At some
point, the algorithms like GMO and RHMC are going to require finishing
the second- (and third-) order autodiff.
A key point is that we haven’t finalized the algorithm such that we’re confident it will work on a specific class of models, converge with good diagnostics, or quantify how much it’s inaccurate. The motivation for implementing it in Stan is for experimentation rather than for applied use. That is, to better understand GMO, we need more access to Stan not only for faster computation but for access of some of its internals.
As an example of the latter point, we cannot do ADVI for the inner optimization well because a key part of GMO is warm starts (providing good inits for later iterations). However, ADVI does not let us specify the standard deviation initialization of the normal variational approximation. For now, I think we can hack at these things—with the clear intention that these will never go into Stan itself; only for temporary experimentation to see if we need such features in the future.
As you mention, an immediate hurdle in just moving from GMO’s R code to C++ is second-order autodiff. I imagine we can use finite diffs for now; this would not be any slower than it currently is in RStan (unless I’m mistaken).
If you really just want to use core Stan for testing, just do this on a branch that you don’t expect to merge. I’ve implemented a couple of wacky ideas that way that never made it into Stan itself and it skips having to stick to the developer process for the prototyping round.
Hi, a couple more things:
There are several GMO algorithms. The one we’ve been focusing on recently is Laplace-based, using a deterministic mode-finder to get the mode of the local parameters, then get the 2nd derivative matrix and draw from the normal approx.
I don’t quite understand why we’d use ADVI for the inner optimization. Isn’t BFGS just fine for that?
Yes we’re doing experimentation but I’d like GMO for immediate applied use as soon as possible. The use-case I’m thinking of is hierarchical linear or logistic regression for problems large enough that Nuts takes hours or more.
Bob asks, “How robust is it relative to ADVI and MCMC across the hundreds of test models we have?” We can’t really run GMO on those hundreds of models without a lot of work, because GMO as currently set up requires the writing of multiple Stan programs to compute the different optimizations. If GMO were programmed in Stan, the way ADVI is programmed in Stan, then we could give it a try.
In any case, GMO is not yet a mature algorithm because even if it ran reasonably well on all these examples, it’s not auto-tuned.
re:2. The inner optimization is always done in service for the outer optimization. That is, during the inner step, we need a good approximation of the local posterior p(\alpha | \phi, y) such that during the outer step, the stochastic gradients have low variance. (Recall the importance ratio between the approximation and the local posterior.)
It would be nice to see how other inference methods would play a role during that inner step. For example, we might expect in certain settings that ADVI or NUTS would be more expensive than LBFGS for the inner step, but it would require fewer outer steps and speed up the algorithm overall.
Yes, I think what we need is a fast and easy-to-understand core algorithm (Laplace with BFGS and some clear adaptation rules) and then other versions of GMO can be improvments.
But this is the hurdle because to implement it so that it can easily run all of those models written as Stan programs will require lots of work. Either we put the burden or proof on those proposing the algorithm or on the core developers. I obviously favor the former,