ADVI / Stochastic quasi-Newton methods

JulianK · September 17, 2017, 5:30am

Hi everyone,

In this thread I note the following comment from @Bob_Carpenter:

This suggests to me that stochastic Adagrad is having trouble discovering the right length scale for the parameters. A good optimisation algorithm shouldn’t be troubled by this.

How hard would it be to implement something like a stochastic quasi-Newton method, eg as described in this paper? This describes a stochastic version of the BFGS algorithm that doesn’t suffer quadratically with the number of parameters.

By approximately learning the Hessian matrix this should deal with the issue of poor parameter scaling. It shouldn’t require any more calculation than Adagrad requires either - just the gradient at each step.

The results in that paper suggest that not only does it give faster convergence but that it gets to a much better solution.

Thoughts?

Julian

bgoodri · September 17, 2017, 6:16am

@yuling tried a trust region algorithm but it didn’t help much

Bob_Carpenter · September 19, 2017, 11:30pm

A better optimization algorithm might help and we encourage people to experiment to make it better.

It’s not just the length scale for the parameters, it’s also correlations and varying curvature/scale around the posterior. What we want is something that dynamically accounts for that like Riemannian HMC, but that’s even worse than quadratic.

I think we also need to understand how the posterior geometry relates to the Gaussian approximation (which is itself often only a diagonal).

joh4n · September 20, 2017, 6:47am

Using natural gradients would provide the desired effect, at least in my understanding. It should be a bigger improvement compared to a different optimizer, although both should help.

Bob_Carpenter · September 20, 2017, 5:26pm

If by that you mean following the latent Riemannian manifold defined by the Hessian (conditioned), then yes, it will. We’re still testing the higher-order autodiff and figuring out how to build it, but hopefully that will be done sooner rather than later. The only problem is the cubic cost in the parameters to compute the curvature information. So we may need to look at sparse or block diagonal or other speedups.

joh4n · September 22, 2017, 9:54pm

Yes, I believe that we are talking about the same thing. I guess it is a trade off between accuracy and speed(as in most cases). One big problem right now is the inability of the optimizer to move “long” distances, so if the model is not designed to be at unit scale one might run into problems.

It would be nice to have the additional control of being able to control the init of the cov mat / sd of the fullrank/meanfield. As it is hard(impossible??) to make sure that all the parameters of the cov mat is close to unit scale in for fullrank. If one had the ability to initialize those values as well, it would be a lot easier to get it to work, although that would move away from an automated/black box approach.

Topic		Replies	Views
Variational Bayes versus MAP for prediction Algorithms	5	3344	December 7, 2019
ADVI / Rats example / Adagrad Algorithms variational-bayes	1	1094	September 17, 2017
New Theoretical analysis for ADVI Algorithms variational-bayes , advi	0	450	May 28, 2023
Stan-relevant paper: Fast Black-box Variational Inference through Stochastic Trust-Region Optimization Algorithms	5	1438	December 22, 2017
Why does ADVI use stochastic gradient ascent not LBFGS Algorithms	10	1624	July 22, 2018

ADVI / Stochastic quasi-Newton methods

Related topics