I was looking more into ADVI (and taking a shot at breaking it up into more sensible pieces…) as part of the output refactor. It struck me that I can’t tell from the ADVI paper why they rely on stochastic gradient ascent (and as an aside, why we don’t just implement stochastic gradient ascent as a separate algorithm since it’s already in Stan). Anybody know the history or general ADVI background here?

LBFGS requires accurate deterministic gradients, but in ADVI only stochastic gradients are available. Stochastic gradient ascent with decreasing step size is simple to implement. There are alternative step size adjustment algorithms some which have been used a lot in ML and some of of those might be better for ADVI. There are couple papers in using stochastic quasi-Newton methods with line-search etc.AFAIK these have not yet been used outside of these papers, and require a bit more in implementation, but might be useful in the future.

Thanks that’s very helpful. I must’ve missed this in the algorithm just b/c it calls the same gradient calc as everything else. I’ll have to spend more time with it.

In any case it sounds like the optimization step might be nice to abstract out of the advi implementation.

The gradients in ADVI are not gradients with respect to the log target density but rather the expected lower bound (ELBO) of the variational objective. The ELBO can be written as an integral which in ADVI is approximated with a Monte Carlo estimator. The gradient of that integral is then approximated as the average gradient of the integrand evaluated at the Monte Carlo samples, which for ADVI is an unbiased estimator of the true gradients. Hence by construction ADVI has access to only stochastic gradients for the internal optimization.

This highlights one of the huge pains the ADVI saga has been – ADVI is not some general approach with various degrees of freedom to be optimized but rather a very particular implementation with choices of those degrees of freedom, like how to differentiate the ELBO, built in to the definition of the algorithm itself. There are all kinds of alternative implementations for these small choices that might be far more robust, especially with some of the newer technologies in the autodiff library, but they have not been evaluated and would yield a completely different algorithm, at least in the machine learning sense.

Regardless, stochastic gradient methods can be extremely fragile and require careful hand-tuning in order to yield reasonable results, especially for nontrivial objectives. It would be a poor choice to expose to users for the same reason we never exposed random walk Metropolis, ensemble samplers, and the like.

This is helpful, thanks.

I think this is also why these procedures are harder to break down into reasonable modules: there are just a lot of somewhat arbitrary pieces. Not that it can’t be cleaned up but I can see the gains are limited.

From what I hear from talking to Matt Hoffman and Alp Kucukelbir (authors of NUTS and ADVI), there’s likely to be a lot of people working on the optimization angle of ADVI going forward.

It’d be nice to abstract out the optimizer here, but it’d also be super useful to abstract L-BFGS out of our optimizer, and also to disentangle all of the algorithms from a template parameter dependency on our model class.

The latter’s going to be a high priority going forward so that we can improve compile time. The first step’s going to be a usable abstract base class for models without any templated methods (so that they can be virtual). Of course, none of this will fly if there’s an appreciable slowdown from the virutal function calls, so we will profile carefully before doing this.

This was also Breck’s and my experience building a lot of large-scale (millions of training data points) sparse L1-regularlized logistic regressions using SGD. The official Robbins-Monro learning rate schemes are way too slow, but more aggressive approaches require more tuning. It’s why I always take any SGD results with a grain of salt, as they’re often dependent on lots of grad-student hours tuning.

Alp (author of ADVI) was suggesting we look into the Nesterov momentum accelerations for the optimization components. @betanalpha was skeptical when I mentioned it to him, as it’s likely to make any stochastic errors worse like L-BFGS (I don’t have any experience with either of these).

I have an ADVI re-write that I should push to the repo as an example of what I’m thinking, haven’t had the time to.

Good thing about the new ADVI diagnostic is that, if optimization fails and the result is bad, the diagnostic will recommend running MCMC instead. If ADVI optimization doesn’t take much time and it works sometimes (and we actually know the more likely cases, e.g., n\gg p and p not too big) it can be useful.

That’s good to hear. I need to understand the diagnostics better.