Optimize (and thus Laplace) yields biased estimates

Stan optimizing seems to yield biased MAP estimates in weighted model, especially for scale parameters.

Attached is the cmdstanr model (modified from brms) together with the corresponding standata.

stan-opt.Rdata (599.4 KB)

This model has two weights: one on subject level (weights_ID) and one on observation level. (weights) The former is derived from the 1st observation of the latter.

for (n in 1:N) {
      int nn = n + start - 1;
      ptarget += weights[nn] * (normal_lpdf(Y[nn] | mu[n], sigma));
} 

...
target += weights_ID[n] * std_normal_lpdf(z_1[:, n]);

The point estimates for random effect dispersion terms are clearly ill under optimize and laplace

  • MCMC:
mcmc <- model$sample(data=sdata, parallel_chains=2, chains=2, iter_sampling = 6000, max_treedepth=11)
mcmc$summary('sd_1')
# A tibble: 2 Ă— 10
  variable  mean median     sd    mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 sd_1[1]  1.17   1.17  0.145  0.144  0.932 1.41   1.00    3133.    5452.
2 sd_1[2]  0.498  0.497 0.0236 0.0234 0.460 0.537  1.00     334.    1173.
  • Optimize
opt <- model$optimize(data=sdata, tol_obj = 1e-9, tol_rel_obj = 1e-3, tol_grad = 1e-9, tol_rel_grad = 1e-3, tol_param = 1e-13, init_alpha = 1e-2, algorithm = 'lbfgs', iter = 3e5)
opt$summary('sd_1')
# A tibble: 2 Ă— 2
  variable estimate
  <chr>       <dbl>
1 sd_1[1]  3.36e+ 1
2 sd_1[2]  1.62e-29
  • Laplace
laplace <- model$laplace(data=sdata, draws=1000, opt_args=list(tol_obj = 1e-9,
              tol_rel_obj = 1e-3,
              tol_grad = 1e-9,
              tol_rel_grad = 1e-3,
              tol_param = 1e-13,
              init_alpha = 1e-3,
              algorithm = 'lbfgs', 
              iter = 3e5))
laplace$summary('sd_1')

# A tibble: 2 Ă— 7
  variable  mean median    sd   mad    q5   q95
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 sd_1[1]  45.5   41.8  19.1  17.1  22.7   79.0
2 sd_1[2]   9.51   8.88  4.06  3.69  4.39  16.9

Is there any tweak I could do to prevent this? Or it this a problem of gradient ascent?

P/S: VB seems to be a bit better. However, sometimes it fails (which is bad).

vb <- model$variational(sdata, adapt_iter=1000, tol_rel_obj = 1e-3)
vb$summary('sd_1')
# A tibble: 2 Ă— 7
  variable   mean median      sd     mad      q5    q95
  <chr>     <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
1 sd_1[1]  0.0215 0.0139 0.0251  0.0110  0.00316 0.0643
2 sd_1[2]  0.468  0.468  0.00415 0.00422 0.461   0.475 
1 Like

Category mentions brms, but the comparison suggests this is all in Stan, is that right?

Without more information, it’s hard to know exactly, but I can see, and actually expect, some problems not to be amenable to optimization methods but feasible through sampling (especially HMC). I assume you looked at at the posterior values, and optimize/laplace would seem to be stuck at some lower optimum or some erroneous value.

Do you get any warnings with optimize? The ideal way would be to get an idea of the trajectory made by each method, though that’s not always practical.

Thanks.

Yes. The problem is not specific to brms but more on Stan’s GD/GA optimisation. However, my code is modified from brms so there is a slim chance that is caused by brms generated code somewhere (I don’t personally believe it).

There is no warning whatsoever in both optimization and laplace (I have attached the dataset so it is fully reproducible). Seemingly the optimizer of Stan is easygoing on convergence criteria that it easily falls into a local minimum? Should some momentum be added?

I agree. It’s unlikely. If it’s not too much work, may be worth checking since it’s the easiest cause to fix.

I’d say the default optimizer, and available options in general would be suitable for most problems, you can check the documentation and look into whether that may not be the case for your particular problem. From what I can gather, there isn’t isn’t an option to add momentum to the optimizers, but it could be that this doesn’t solve it either. My instinct for anything beyond the simplest models is “just MCMC-sample”, which is also the reason I never really got into INLA.

One way of testing this hypothesis is reducing the size of the model, or simplifying it until optimization works as expected.

The last option is I can think about is the posterior being quite skewed, and the mean/median actually being far form the mode, but it seems like it’s unlikely to be the case with the values too far off.

1 Like

Good point! I also don’t put much bet in anything but MCMC for complicated models. Its downside is, however, speed. I am doing a simulation study where only the point estimates are needed. Exploring the whole posterior, while exact, is extremely expensive and not useful.

I will dig into this a bit. I tried different runs of optimize and the model gives different answers, which also indicates that the latent space is too difficult for current Stan optimizer. The doubly weighting technique I use is likely the cause.

2 Likes