Does ADVI stepsize choice have the same problem with its stopping rule?

Does setting stepsize also needs improvement in current ADVI implementation?

Robust VI paper’s motivation is that ∆ELBO is too noisy and not scale-optimized to be used for stopping rule ( @avehtari 's more detailed summary).

I thought even though the updated version of stopping rule is implemented, the same problem could remain in setting the stepsize (eta in the code). We might be trying to improve ‘noisy objective function’ problem (mostly this part) with a fixed stepsize which are the result of an operation on ‘noisy objective function’.

I have thought of two possibilities:

More generally, sometimes the objective estimates are too noisy relative to the chosen step size η ,

  1. improvement not needed:
    since the relative scale is the problem (quote from the paper), optimizing the stopping rule given the fixed step size might be enough?

  2. improvement needed:
    If so, could I ask for some opinion on its improvement? I guess Monte Carlo standard error suggested in the paper cannot be applied here as stepsize, unlike solution \lambda_{t}, does not form a Markov chain (or does it?).

Please let me know if I am missing something, thanks!

2 Likes

How does the current stepsize selection work anyway?

1 Like

It heuristically decreases the step size until no divergence is observed. I have linked the corresponding codes in the above post. The last operation over noisy objective part.

1 Like

Weird, I had a look at that code.

It looks like what it’s doing is from a short list of possible step sizes: https://github.com/stan-dev/stan/blob/develop/src/stan/variational/advi.hpp#L176

double eta_sequence[eta_sequence_size] = {100, 10, 1, 0.1, 0.01};

it’s picking the one that gets the best ELBO after some short amount of running ADVI time?

Is that how you’re reading this too? What are the other options people use for setting stepsize?

You are right, I got confused with this adaptation part where stepsize is decreased while computing elbo for each eta.
After all, eta was not stepsize but the following according to the appendix E of ADVI paper.
\rho_{k}^{(i)}=\frac{\eta}{\tau+\sqrt{s_{k}^{(i)}}}

So, is it a correct understanding that since it is a short amount of running time, smaller eta could result in smaller ELBO as it has not yet reached the optimal point?

Some ideal conditions for stepsize:

  • minimize approximated (taylor quadratic or higher) function at the current point
  • proven convergence
  • sufficient decrease
  • curvature condition etc
    For more stepsize settings, see Chapter3. Nocedal.

ADVI paper says adaGrad which has convergence properties is used with some modification: they limited the number of historic gradients used to compute the step size.

Yeah, the reason you don’t want a small timestep is the optimization takes longer, so I think this was the intention.

With ADVI the gradients and ELBO itself are super noisy and that changes what algorithms are useful. It’s hard to figuring out convergence-like things cause the ELBO is so noisy, and I think that’s doubly true for the gradients (so I don’t think curvature stuff would be easy to look at either).

1 Like