Does ADVI stepsize choice have the same problem with its stopping rule?

hyunji.moon · December 16, 2020, 5:49pm

Does setting stepsize also needs improvement in current ADVI implementation?

Robust VI paper’s motivation is that ∆ELBO is too noisy and not scale-optimized to be used for stopping rule ( @avehtari 's more detailed summary).

I thought even though the updated version of stopping rule is implemented, the same problem could remain in setting the stepsize (eta in the code). We might be trying to improve ‘noisy objective function’ problem (mostly this part) with a fixed stepsize which are the result of an operation on ‘noisy objective function’.

I have thought of two possibilities:

More generally, sometimes the objective estimates are too noisy relative to the chosen step size η ,

improvement not needed:
since the relative scale is the problem (quote from the paper), optimizing the stopping rule given the fixed step size might be enough?
improvement needed:
If so, could I ask for some opinion on its improvement? I guess Monte Carlo standard error suggested in the paper cannot be applied here as stepsize, unlike solution \lambda_{t}, does not form a Markov chain (or does it?).

Please let me know if I am missing something, thanks!

bbbales2 · December 17, 2020, 12:05am

How does the current stepsize selection work anyway?

hyunji.moon · December 17, 2020, 3:24am

It heuristically decreases the step size until no divergence is observed. I have linked the corresponding codes in the above post. The last operation over noisy objective part.

bbbales2 · December 17, 2020, 11:54pm

Weird, I had a look at that code.

It looks like what it’s doing is from a short list of possible step sizes: https://github.com/stan-dev/stan/blob/develop/src/stan/variational/advi.hpp#L176

double eta_sequence[eta_sequence_size] = {100, 10, 1, 0.1, 0.01};

it’s picking the one that gets the best ELBO after some short amount of running ADVI time?

Is that how you’re reading this too? What are the other options people use for setting stepsize?

hyunji.moon · December 21, 2020, 9:52am

You are right, I got confused with this adaptation part where stepsize is decreased while computing elbo for each eta.
After all, eta was not stepsize but the following according to the appendix E of ADVI paper.
\rho_{k}^{(i)}=\frac{\eta}{\tau+\sqrt{s_{k}^{(i)}}}

So, is it a correct understanding that since it is a short amount of running time, smaller eta could result in smaller ELBO as it has not yet reached the optimal point?

Some ideal conditions for stepsize:

minimize approximated (taylor quadratic or higher) function at the current point
proven convergence
sufficient decrease
curvature condition etc
For more stepsize settings, see Chapter3. Nocedal.

ADVI paper says adaGrad which has convergence properties is used with some modification: they limited the number of historic gradients used to compute the step size.

bbbales2 · December 21, 2020, 3:08pm

Yeah, the reason you don’t want a small timestep is the optimization takes longer, so I think this was the intention.

With ADVI the gradients and ELBO itself are super noisy and that changes what algorithms are useful. It’s hard to figuring out convergence-like things cause the ELBO is so noisy, and I think that’s doubly true for the gradients (so I don’t think curvature stuff would be easy to look at either).

Topic		Replies	Views
Why does ADVI use stochastic gradient ascent not LBFGS Algorithms	10	1656	July 22, 2018
ADVI / Rats example / Adagrad Algorithms variational-bayes	1	1097	September 17, 2017
In Rstan, for ADVI, is there a way to produce ELBO and eta (step-size)? RStan	5	1759	December 20, 2019
Stan Variational Inference Deprecation and Documentation Developers fitting-issues	2	170	April 25, 2025
ADVI with Diffusion model Modeling	3	2146	September 28, 2017

Does ADVI stepsize choice have the same problem with its stopping rule?

Related topics