This suggests to me that stochastic Adagrad is having trouble discovering the right length scale for the parameters. A good optimisation algorithm shouldn’t be troubled by this.
How hard would it be to implement something like a stochastic quasi-Newton method, eg as described in this paper? This describes a stochastic version of the BFGS algorithm that doesn’t suffer quadratically with the number of parameters.
By approximately learning the Hessian matrix this should deal with the issue of poor parameter scaling. It shouldn’t require any more calculation than Adagrad requires either - just the gradient at each step.
The results in that paper suggest that not only does it give faster convergence but that it gets to a much better solution.
A better optimization algorithm might help and we encourage people to experiment to make it better.
It’s not just the length scale for the parameters, it’s also correlations and varying curvature/scale around the posterior. What we want is something that dynamically accounts for that like Riemannian HMC, but that’s even worse than quadratic.
I think we also need to understand how the posterior geometry relates to the Gaussian approximation (which is itself often only a diagonal).
Using natural gradients would provide the desired effect, at least in my understanding. It should be a bigger improvement compared to a different optimizer, although both should help.
If by that you mean following the latent Riemannian manifold defined by the Hessian (conditioned), then yes, it will. We’re still testing the higher-order autodiff and figuring out how to build it, but hopefully that will be done sooner rather than later. The only problem is the cubic cost in the parameters to compute the curvature information. So we may need to look at sparse or block diagonal or other speedups.
Yes, I believe that we are talking about the same thing. I guess it is a trade off between accuracy and speed(as in most cases). One big problem right now is the inability of the optimizer to move “long” distances, so if the model is not designed to be at unit scale one might run into problems.
It would be nice to have the additional control of being able to control the init of the cov mat / sd of the fullrank/meanfield. As it is hard(impossible??) to make sure that all the parameters of the cov mat is close to unit scale in for fullrank. If one had the ability to initialize those values as well, it would be a lot easier to get it to work, although that would move away from an automated/black box approach.