ADVI / Rats example / Adagrad

Curiosity drove me back to look at the in the context of ADVI that I posted on the old Google groups discussion list.

I can get reasonable results from ADVI for RATS if I start with init=0 and eta=10 but not eta=1.

The higher the value of eta I use, the faster the convergence I get - if vb() runs (otherwise I get a message about the problem being too ill-conditioned).

This drove me back to look at the Kucukelbir et al ADVI paper.

Limited memory Adagrad / length of memory

In section E, “setting a step size for ADVI”, I note that the per-parameter step size for ADVI is derived based on a sliding window containing the sum of the squared gradients for the last 10 iterations.

How was this figure of “10” chosen? Some of the poor ADVI convergence seen in different models could be due to this quantity being too noisy due to the finite history. Did you ever consider trying a longer window?

Just something to consider…

Thanks for the input. I really don’t know the answer. Is the 10 a default that’s configurable? Is a different number more robust but maybe slower? If so, we’d go for more robustness in general.

Alp and Dustin aren’t really working on stan any more, so I don’t know what their original motiviations were. Andrew Gelman and Yuling Yao are looking at improving ADVI, but I don’t know where they’re at with it. I think their preliminary results were that scale mattered a lot.