Curiosity drove me back to look at the in the context of ADVI that I posted on the old Google groups discussion list.
I can get reasonable results from ADVI for RATS if I start with init=0
and eta=10
but not eta=1
.
The higher the value of eta
I use, the faster the convergence I get - if vb()
runs (otherwise I get a message about the problem being too ill-conditioned).
This drove me back to look at the Kucukelbir et al ADVI paper.
Limited memory Adagrad / length of memory
In section E, “setting a step size for ADVI”, I note that the per-parameter step size for ADVI is derived based on a sliding window containing the sum of the squared gradients for the last 10 iterations.
How was this figure of “10” chosen? Some of the poor ADVI convergence seen in different models could be due to this quantity being too noisy due to the finite history. Did you ever consider trying a longer window?
Just something to consider…