Does setting stepsize also needs improvement in current ADVI implementation?
Robust VI paper’s motivation is that ∆ELBO
is too noisy and not scaleoptimized to be used for stopping rule ( @avehtari 's more detailed summary).
I thought even though the updated version of stopping rule is implemented, the same problem could remain in setting the stepsize (eta in the code). We might be trying to improve ‘noisy objective function’ problem (mostly this part) with a fixed stepsize which are the result of an operation on ‘noisy objective function’.
I have thought of two possibilities:
More generally, sometimes the objective estimates are too noisy relative to the chosen step size η ,

improvement not needed:
since the relative scale is the problem (quote from the paper), optimizing the stopping rule given the fixed step size might be enough?

improvement needed:
If so, could I ask for some opinion on its improvement? I guess Monte Carlo standard error suggested in the paper cannot be applied here as stepsize, unlike solution \lambda_{t}, does not form a Markov chain (or does it?).
Please let me know if I am missing something, thanks!
2 Likes
How does the current stepsize selection work anyway?
1 Like
It heuristically decreases the step size until no divergence is observed. I have linked the corresponding codes in the above post. The last operation over noisy objective part.
1 Like
Weird, I had a look at that code.
It looks like what it’s doing is from a short list of possible step sizes: https://github.com/standev/stan/blob/develop/src/stan/variational/advi.hpp#L176
double eta_sequence[eta_sequence_size] = {100, 10, 1, 0.1, 0.01};
it’s picking the one that gets the best ELBO after some short amount of running ADVI time?
Is that how you’re reading this too? What are the other options people use for setting stepsize?
You are right, I got confused with this adaptation part where stepsize is decreased while computing elbo for each eta.
After all, eta
was not stepsize but the following according to the appendix E of ADVI paper.
\rho_{k}^{(i)}=\frac{\eta}{\tau+\sqrt{s_{k}^{(i)}}}
So, is it a correct understanding that since it is a short amount of running time, smaller eta
could result in smaller ELBO as it has not yet reached the optimal point?
Some ideal conditions for stepsize:
 minimize approximated (taylor quadratic or higher) function at the current point
 proven convergence
 sufficient decrease
 curvature condition etc
For more stepsize settings, see Chapter3. Nocedal.
ADVI paper says adaGrad which has convergence properties is used with some modification: they limited the number of historic gradients used to compute the step size.
Yeah, the reason you don’t want a small timestep is the optimization takes longer, so I think this was the intention.
With ADVI the gradients and ELBO itself are super noisy and that changes what algorithms are useful. It’s hard to figuring out convergencelike things cause the ELBO is so noisy, and I think that’s doubly true for the gradients (so I don’t think curvature stuff would be easy to look at either).
1 Like