ADVI: Too many dropped evaluations even for well behaved models

The current ADVI implementation can sometimes (stochastically) fail to fit even the simplest models. I think this might indicate that some defaults for algorithm parameters are suspicious.

Context: I am helping @hyunji.moon to build an example of using SBC with ADVI for the SBC package. This means that we need do a lot of ADVI runs. Some of the runs do stochastically result in an error instead of giving us output, even for simple model.

Here’s a simple reproducible example with latest cmdstanr + CmdStan 2.27.0

library(cmdstanr)
tn <- "
data {
  int N;
  vector[N] y;
}

parameters {
  real loc;
  real <lower = 0> scale;
}

model {
  loc ~ normal(0, 1);
  scale ~ lognormal(0, 1);
  y ~ normal(loc, scale);
}
"
simple_mod = cmdstan_model(stan_file = write_stan_file(tn))

set.seed(48752)
loc <- 1
scale <- 1

data <- list(
  N = 20,
  y = rnorm(20, loc, scale)
)

n_errors <- 0
sink("all_ouptuts.txt", type = "output")
for(i in 1:100) {
  fit <- simple_mod$variational(data = data)
  if(all(fit$return_codes() != 0)) {
    n_errors <- n_errors + 1
  }
}
sink(NULL)
  
cat("Total errors: ", n_errors)

I get 4 errors. The exact number of errors varies slightly with seed, but there are almost always some. All the error messages are:

Chain 1 stan::variational::normal_meanfield::calc_grad: 
The number of dropped evaluations has reached its maximum amount (10). 
Your model may be either severely ill-conditioned or misspecified.

It doesn’t appear to be really documented, but looking at the code the number of dropped evaluations that triggers the message depends on the grad_samples parameter. And indeed, when I add grad_samples = 20 to the $variational call, I quite reliably get no errors.

To reliably get no error across 1000 fits, I need to ramp up grad_samples even higher…

I don’t understand internals of ADVI very well, but if even such a simple model has a reasonable (a few percent) chance of failing on default settings, maybe the defaults should be made more conservative? Or is there other than performance downside to setting grad_samples to be larger? Or is there something else one should do to make the ADVI results a bit more stable?

I’ve noted two previous mentions of the error: Stan::variational::normal_meanfield::calc_grad - can be falsely driven by tranformed parameters? and
"stan::variational::normal_meanfield::calc_grad: The number of dropped evaluations has reached its maximum amount (10)." Is 10 a reasonable number? in both cases the recommendation was to make the model better behaved - I however think that in the case I present here, the model is about as well behaved as you can get.

2 Likes

Thanks @martinmodrak for starting this thread.
I wish to add some information for ADVI improvements:

  1. Returned fits are NULL which contradicts its naming (fit), and the fastest fix (without addressing the underlying problem) would be to restart the fitting once the maximum evaluation is reached.
  2. Examples of ADVI failure which @andrewgelman shared based on the test in 2017. Unzip files from here: http://www.stat.columbia.edu/~gelman/temp/advi_test.zip You’ll have a little directory. Then run the file ADVI_text_explore.R. Some of the models that failed are extremely simple models, for example logistic regression with two coefficients.

I haven’t tested Andrew’s examples yet and it would be very helpful if anyone could actually help running SBC following this vignette.

Does the robust VI implementation work better?

cmdstanr::install_cmdstan(release_url = "https://github.com/Dashadower/cmdstan/releases/download/rvi4_v2/cmdstan-rvi_faso_upgrade_2.tar.gz")
1 Like

Internally, dropped evaluations equates to the number of draws in which an exception occurs. And from debugging the ADVI code, there’s two cases where an exception can occur:

  1. An exception occurs during stan::model::gradient
  2. stan::math::check_finite throws an exception after the gradient has been successfully calculated, meaning nans or inf is present.

Given these two information, I ran your example code with some debugging code tacked on to stan and investigated parameter and gradient draws(note parameter draws have been inverse transformed):

MU VECTOR: 
25.3639, 10.4316,  
OMEGA VECTOR: 
-33.1603, 11.6076,  
Parameter names and draw, grad values: 
loc, scale, grad_0, grad_1,  
25.3639, inf, -25.3639, -nan, EXCEPTION RAISED  
25.3639, 0, EXCEPTION RAISED  
25.3639, 0, EXCEPTION RAISED  
25.3639, 0, EXCEPTION RAISED  
25.3639, inf, -25.3639, -nan, EXCEPTION RAISED  
25.3639, inf, -25.3639, -nan, EXCEPTION RAISED  
25.3639, 0, EXCEPTION RAISED  
25.3639, 0, EXCEPTION RAISED  
25.3639, inf, -25.3639, -nan, EXCEPTION RAISED  
25.3639, 0, EXCEPTION RAISED  
MU VECTOR: 
-8.95628, 8.4987,  
OMEGA VECTOR: 
-9.96037, 9.61455,  
Parameter names and draw, grad values: 
loc, scale, grad_0, grad_1,  
-8.95627, inf, 8.95627, -nan, EXCEPTION RAISED  
-8.9562, inf, 8.9562, -nan, EXCEPTION RAISED  
-8.95631, inf, 8.95631, -nan, EXCEPTION RAISED  
-8.95621, 0, EXCEPTION RAISED  
-8.95638, inf, 8.95638, -nan, EXCEPTION RAISED  
-8.95622, 0, EXCEPTION RAISED  
-8.95624, inf, 8.95624, -nan, EXCEPTION RAISED  
-8.95634, 0, EXCEPTION RAISED  
-8.95626, inf, 8.95626, -nan, EXCEPTION RAISED  
-8.95626, 0, EXCEPTION RAISED 

In meanfield, the approximation to parameter transformation is defined by exp(omega) * std_normal() + mu, where omega is the log standard deviation. And after that transformation we exponentiate it again back to constrained space, so the values would blow up or result in 0:

> exp(exp(9.6) * rnorm(10) + 8.4)
 [1]  0.000000e+00           Inf  0.000000e+00           Inf           Inf  0.000000e+00           Inf 1.173377e-298  0.000000e+00           Inf

So I believe the culprit to be the values of omega for scale at gradient evaluation being too big, causing exponentiation to give extreme values, which results in gradient() throwing an error when scale is 0, or inf trigging check_finite() through nan gradient values.

As an additional note, if ADVI failes to get a total of grad_samples number of valid gradient samples(samples that don’t trigger exceptions) within 10 * grad_samples attempts, that’s when it terminates and gives you that error.

3 Likes

I have assumed this problem would mostly raise in the initialisation or during early iterations. Do you have an example where there would be dropped evaluations later?

As increasing the number of gradient evaluations increase the computational cost, would it be possible to have separate minimum and maximum number of evaluations? (Although I wouldn’t spend too much time on trying to fix this)

Yes exactly. For the example model in the original post, it either fails to calculate gradients from the start or within the first 100 iterations. Do you have any ideas on why this behavior would occur on some models? I personally never had issues with running ADVI itself, just results being consistently inaccurate :)

Can you elaborate on what you mean by minimum evaluations? Instead of attempting to get the full amount of samples just pluck out some samples that worked?

1 Like

This would require change in the code. It could be useful that you could define that usually, e.g. 10 draws are used, so that computation stays fast. But if all 10 draws produce Inf/NaN, then draw more, e.g. 100 draws. Then more draws would be used only in the beginning. I would expect that Pathfinder would also help to get better initial values. I’m also sceptical on ADVI performance overall as we demonstrate in

2 Likes

I saw this as well - for the CmdStanPy wrapper, when this happens, it will advise the user to double the grad_samples argument - PR is here: Feature/433 notebook advi init sampler by mitzimorris · Pull Request #473 · stan-dev/cmdstanpy · GitHub

1 Like