ADVI / Rats example / Adagrad

JulianK · September 12, 2017, 3:11am

Curiosity drove me back to look at the in the context of ADVI that I posted on the old Google groups discussion list.

I can get reasonable results from ADVI for RATS if I start with init=0 and eta=10 but not eta=1.

The higher the value of eta I use, the faster the convergence I get - if vb() runs (otherwise I get a message about the problem being too ill-conditioned).

This drove me back to look at the Kucukelbir et al ADVI paper.

Limited memory Adagrad / length of memory

In section E, “setting a step size for ADVI”, I note that the per-parameter step size for ADVI is derived based on a sliding window containing the sum of the squared gradients for the last 10 iterations.

How was this figure of “10” chosen? Some of the poor ADVI convergence seen in different models could be due to this quantity being too noisy due to the finite history. Did you ever consider trying a longer window?

Just something to consider…

Bob_Carpenter · September 17, 2017, 4:20pm

Thanks for the input. I really don’t know the answer. Is the 10 a default that’s configurable? Is a different number more robust but maybe slower? If so, we’d go for more robustness in general.

Alp and Dustin aren’t really working on stan any more, so I don’t know what their original motiviations were. Andrew Gelman and Yuling Yao are looking at improving ADVI, but I don’t know where they’re at with it. I think their preliminary results were that scale mattered a lot.

Topic		Replies	Views
Does ADVI stepsize choice have the same problem with its stopping rule? Algorithms	5	496	December 21, 2020
Variational Bayes runtime and memory usage Algorithms	13	1477	January 23, 2018
Why does ADVI use stochastic gradient ascent not LBFGS Algorithms	10	1502	July 22, 2018
ADVI with Diffusion model Modeling	3	2060	September 28, 2017
Convergence of variational inference General variational-bayes	4	1837	December 5, 2020

ADVI / Rats example / Adagrad

Related Topics