This suggests to me that stochastic Adagrad is having trouble discovering the right length scale for the parameters. A good optimisation algorithm shouldn’t be troubled by this.
How hard would it be to implement something like a stochastic quasi-Newton method, eg as described in this paper? This describes a stochastic version of the BFGS algorithm that doesn’t suffer quadratically with the number of parameters.
By approximately learning the Hessian matrix this should deal with the issue of poor parameter scaling. It shouldn’t require any more calculation than Adagrad requires either - just the gradient at each step.
The results in that paper suggest that not only does it give faster convergence but that it gets to a much better solution.