New adaptive warmup proposal (looking for feedback)!

There’s a new, experimental warmup algorithm on the block! It’s available as an r package here: campfire (get it? campfires are for warming up)

The the big features are:

  1. Adaptive length warmup – only run warmup as long as you need
  2. Don’t throw away warmup samples unnecessarily
  3. Use information from multiple chains during warmup
  4. Automatically switch between dense, diagonal, and hessian-based metrics

This is not suitable for deployment (you’ll notice it has zero tests). This is exclusively for testing a new proposal for adaptive warmup.

If it works for people, then we can get it integrated into Stan and make it an actual production thing, until then it’s experimental.

So this is why we need people to test it and validate it! Hooray!

How the adaptation works

Before we go into the new stuff, let’s go through the current stuff.

Current adaptation

The current adaptation is broken down into windows by number of draws:

75 | 25 | 50 | 100 | 200 | 500 | 50

So that’s 75 draws in the first window, 25 in the second, 50 in the third, etc.

Stan’s timestep is adapted at every draw in all of the windows. The timestep adaptation happens faster in the first 75 and last 50 draws.

At the end of each of these windows (where the bars are) this thing called the euclidean metric (or the mass matrix) is recomputed. This was a linear transformation of the variables that basically makes the HMC algorithm work faster. By default Stan uses a diagonal covariance estimate for this.

This works well enough. But we wanna improve it! The downsides of this are:

  1. Our warmup is the same length for any model unless we carefully adjust it. We either waste time adapting too much, or do not adapt enough. Either way it is not ideal.
  2. At the end of each warmup window we toss all the previous adaptation. This seems wasteful
  3. We are currently adapting each chain individually
  4. Picking between dense and diagonal adaptation isn’t always obvious

Proposed adaptation

Instead of using variable width windows we used a fixed window size. As a default we use 100 draws, so we get something like:

100 | 100 | 100 | 100 ...

At the end of each window the metric gets recomputed (just like before). The difference now is instead of just using draws from the last window we want to potentially use all draws. To do this we take advantage of our effective sample size (ESS) estimates.

If we label our windows with letters:

A | B | C | D ...

we expect the draws in A to somehow be worse than those in B or C or D (because the later windows will be closer to the answer or there will be more adaptation). So a simple way to ask if the draws in A are inconsistent with BCD is to check Rhat (or ESS) with all the windows (ABCD) vs. just BCD.

We can go ahead and compare all combinations of ABCD, BCD, CD, and D and figure out the best one by just picking whichever has the largest ESS.

Similarly, we can decide when to stop adapting when we have hit sufficient ESS on all parameters and the Rhats are suitably low. Because we lean on Rhat and ESS, we only do this warmup with multiple chains. Because of this, when it’s time to compute a new metric, we use samples from both chains to do this!

That’s it! I also threw all the stuff from https://arxiv.org/abs/1905.11916 in there to automatically switch between different metrics and whatnot but these are really separable things.

Installation Instructions

This requires the posterior and cmdstanr packages, so gotta install those first:

devtools::install_github("jgabry/posterior")

And then for cmdstan (which requires posterior be installed first):

devtools::install_github("jgabry/cmdstanr")

If you haven’t installed cmdstanr before you can use:

library(cmdstanr)
install_cmdstan()

To download and build an appropriate cmdstan for use in cmdstanr (don’t need anything special for this).

The warmup package can be installed with:

devtools::install_github("bbbales2/campfire")

Example use

Campfire manages the warmup part of adaptation. So you call campfire::warmup with the same arguments as a normal cmdstanr model$sample call:

library(campfire)
library(rstan)

model_path = "accel_splines.stan"

data_env = new.env(parent = baseenv())
source("accel_splines.data.R", local = data_env)
data = as.list.environment(data_env)

stan_fit = stan(model_path, data = data, iter = 1)
out = warmup(model_path, stan_fit, data = data, num_chains = 2, num_cores = 2, print_stdout = FALSE)

And you can take the output of a warmup call and use it to sample the actual model. It’ll use the metric computed in campfire::warmup, but it’ll still do 50 steps of timestep adaptation at the beginning:

model = cmdstanr::cmdstan_model(model_path)
fit = do.call(model$sample, out$args)

You should be able to pass most arguments you would pass to sample to warmup and it’ll just carry them through (like number of chains, amount of sampling, etc). There are a couple extra arguments to warmup:

  1. print_stdout (boolean, default FALSE) whether to print the fits running – they produce lots of output
  2. window_size (integer, default 100) size of warmup windows
  3. max_num_windows (integer, default 10) maximum number of warmup windows to use
  4. target_rhat (double, default 1.05) all parameters must have an rhat less than this to consider warmup finished
  5. target_ess (integer, default 50) all parameters must have a bulk_ESS greater than this to consider warmup finished

The out object is a list with 3 things: args, which are arguments suitable for calling sample in cmdstanr with, usamples, which are the unconstrained warmup samples computed in adaptation, and nleapfrogs, which is the total number of leapfrog steps taken in the calculation.

While adaptation is happening you’ll see two types of things printing. The first are tibbles of condition number:

[1] "Metric calculation info (lower c better, last is always picked, minimum is 1.0):"
# A tibble: 7 x 2
  name       c_hybrid
  <chr>         <dbl>
1 diag           325.
2 dense          207.
3 hess1_wish     207.
4 hess2_wish     207.
5 lw2004         200.
6 hess2          118.
7 hess1          117.

So at each adaptation we’re proposing a few different metrics and picking between them based on this condition number thing. Lower is better, the reasoning behind this is explained here (and some docs on what those different metric options are): https://arxiv.org/abs/1905.11916

The other is a table:

[1] "Adaptive warmup debug info (figure out how many initial windows to drop):"
# A tibble: 4 x 4
  drop_first_n_windows picked max_Rhat min_Bulk_ESS
                 <int> <chr>     <dbl>        <dbl>
1                    0 ""         1.26            8
2                    1 ""         1.15           24
3                    2 picked     1.04           64
4                    3 ""         1.09           23

Which is used for the decision making of how many windows of warmup to use/drop. In this case we have decided to drop the first 2 warmup windows. Dropping these two warmups gives us a maximum Rhat of 1.04 and a minimum Bulk_ESS of 64. We pick adaptations based on maximizing Bulk_ESS. We can look at a plot of the warmup samples of one chain:

plot(out$usamples[,1,2]) # plot warmup draws for chain 1, parameter 2

That gives:
warmup

It sure seems like dropping those first two hundred samples was a good idea!

The model and data are here: accel_splines.data.R (207.6 KB) accel_splines.stan (2.7 KB)

Feedback

All feedback is good. There’s probably lots of stuff wrong with this so lemme know. Some other questions are:

  1. Does this adaptation work for your model?

  2. Does this adaptation result in the sampling stage of your model running more quickly?

  3. Does ESS or Rhat go up or down?

  4. Does this adaptation fail for your model, especially if it works with the old adaptation mechanism?

  5. How many warmup windows does it take for your model to decide it is warmed up sufficiently?

The way the package is coded right now, the warmup stage itself might take longer to run even with fewer draws cause it’s just not efficient. Also there are the 50 extra timestep adaptation steps that are not included in the campfire$warmup call.

If you have a model that behaves weirdly and can share it with data, please do!

Citations

So the ideas for the different bits here came from a bunch of people from this so I’m just gonna leave a pile of names down here, hoooray! Thanks for the ideas (edit: and help on everything else) everyone!:

@yuling, @avehtari, @arya, @aseyboldt, @bgoodri, @Bob_Carpenter, @wds15, @andrewgelman, @lauren, @rok_cesnovar

19 Likes

I was keen to try it out but cmdstanr is not working. I got past a few errors (would be good to note somewhere that coreutils / cygwin or something similar is needed to provide functionality like ‘cut’ in windows), but am stuck with all sorts of compile errors…

You only need to have RTools installed in order to run Cmdstan® on Windows, but you do have to add the C:/Rtools/bin and C:/Rtools/mingw_64/bin to the PATH environment variable. There are all needed tools like cut and others in C:/Rtools/bin.

Can you try adding C:/Rtools/bin; C:/Rtools/mingw_64/bin to the PATH variable and try again?

We could and probably will add that cmdstanr does that for you, but that doesnt help you now.

1 Like

Agree, if you have more issue, open another topic or an issue on the cmdstanr github and we can continue there to not spam this one. I do agree that this is awesome.

I am asuming things got weird because you compiled cmdstan with cygwin and have now switched to Rtools. Maybe try using install_cmdstan(overwrite=TRUE) again to use rtools from the start.

We have continuous integration running on Windows and I also do occasional work on Windows so we know that at least with Rtools it should not be a problem.

1 Like

@Charles_Driver lemme know if you get it running or not. If you have a particular stan model + data file I can run it for yah too and report back if it seems like this is behaving any better/worse than the current adaptation.

@rok_cesnovar thanks for responding to this!

@wds15, @yizhang you are the two people I know were actively developing warmup experiments. If you have any particular models/data in mind or features/warmup variations you wanted to test lemme know. This code is pretty short it might not be too hard to fit other stuff in.

@bbbales2 sorry, can’t test until the zero dimension issues I posted here https://github.com/stan-dev/cmdstanr/issues/96 get fixed!

1 Like

Hi!

This looks really cool… I am definetly interested, just out of time at the moment. What I would really like to see first is benchmarking code. So I would like to write a Batchtools based numerical experiment where we can easily add models and warmup procedures to evaluate them with replication, etc…

If I understand you right then, we should add other warmup procedures to your package to quickly explore them… makes sense to me! Messing with C++ takes a lot longer than going via an interface.

Sebastian

The 0x0 thing should be fixed! Report back how it goes. Edit: please*

very rough initial impressions on a simple hierarchical sde model with overly wide priors – maybe 10x faster to a functional n_ess. Using window_size=10, max_num_windows=100. Seems great! will test it out more thoroughly as I have time.

5 Likes

If I set a target Rhat of 1.01 and target ess of 500, is there something that should stop me using the last x samples as the basis for inference? I somehow feel like this would be wrong but also can’t exactly see why…

The warmup isn’t valid MCMC cause we’re changing the timestep every draw and the metric every window_size draws.

We’re assuming it’s doing well enough to trust the samples to build the metric, but other than that we toss em’. Sampling follows all the MCMC rules so we trust those samples. I think that’s basically it.

3 Likes

This looks really cool! I finally got to install it and throw two models on it. The models are part of my package: https://cran.r-project.org/package=OncoBayes2 I observed that

  • the “combo2” model (you can run it with example_model(“combo2”)) runs better with the vanilla cmdstan - so I am getting higher neff for a standard run (1k warmup, 1k sampling)

  • the combo3 model is runnning somewhat better with campfire warmup in terms of Neff

The runtime in both cases is of the same order… maybe campfire is a bit faster overall. Timing includes here the warmup and sampling time, but not the model setup stuff.

I would need to setup a simulation study to make the above statements more firm. Before doing that I will probably try this thing out on an ODE model which I usually run with short fixed warmup, but any save here is huge… or what else would make sense?

Any other “data” you are looking for?

Can you provide effective sample size per log density evaluation (n_leapfrog)?

Campfire run, combo2:

Inference for Stan model: blrm_exnex_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (0.085, 0.11, 0.12, 0.080) seconds, 0.39 seconds total
Sampling took (2.0, 1.5, 2.1, 1.9) seconds, 7.6 seconds total

                              Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                      -5.1e+01  1.2e-01  5.0e+00    -60  -5.1e+01  -4.4e+01  1.8e+03  2.4e+02  1.0e+00
accept_stat__              9.0e-01  4.2e-03  1.7e-01   0.55   9.6e-01   1.0e+00  1.6e+03  2.1e+02  1.0e+00
stepsize__                 1.8e-01  1.2e-02  1.7e-02   0.16   1.9e-01   2.0e-01  2.0e+00  2.6e-01  2.0e+13
treedepth__                4.6e+00  2.2e-01  4.8e-01    4.0   5.0e+00   5.0e+00  4.7e+00  6.2e-01  1.3e+00
n_leapfrog__               2.7e+01  2.8e+00  6.9e+00     15   3.1e+01   3.1e+01  6.0e+00  7.9e-01  1.2e+00
divergent__                5.0e-04      nan  2.2e-02   0.00   0.0e+00   0.0e+00      nan      nan      nan
energy__                   7.7e+01  1.8e-01  7.1e+00     66   7.7e+01   8.9e+01  1.6e+03  2.1e+02  1.0e+00
log_beta_raw[1,1,1]       -5.3e-02  1.3e-02  9.8e-01   -1.6  -5.0e-02   1.6e+00  5.6e+03  7.4e+02  1.0e+00
log_beta_raw[1,1,2]        5.6e-02  1.5e-02  9.8e-01   -1.6   6.1e-02   1.7e+00  4.5e+03  5.9e+02  1.0e+00
log_beta_raw[1,2,1]        1.8e-02  1.3e-02  1.0e+00   -1.6   2.0e-02   1.7e+00  5.6e+03  7.4e+02  1.0e+00
log_beta_raw[1,2,2]        2.9e-04  1.3e-02  9.9e-01   -1.6   7.7e-04   1.6e+00  5.4e+03  7.2e+02  1.0e+00

Combo2, cmdstan run:

Inference for Stan model: blrm_exnex_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (2.2, 2.1, 2.1, 2.2) seconds, 8.6 seconds total
Sampling took (2.1, 2.1, 1.8, 2.1) seconds, 8.1 seconds total

                              Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                      -5.2e+01  1.4e-01  5.2e+00    -61  -5.2e+01  -4.4e+01  1.4e+03  1.7e+02  1.0e+00
accept_stat__              9.2e-01  7.1e-03  1.5e-01   0.60   9.7e-01   1.0e+00  4.6e+02  5.7e+01  1.0e+00
stepsize__                 1.5e-01  1.7e-02  2.4e-02   0.12   1.5e-01   1.9e-01  2.0e+00  2.5e-01  3.7e+13
treedepth__                4.9e+00  1.1e-01  2.9e-01    4.0   5.0e+00   5.0e+00  6.7e+00  8.4e-01  1.2e+00
n_leapfrog__               3.0e+01  1.2e+00  3.9e+00     15   3.1e+01   3.1e+01  9.8e+00  1.2e+00  1.1e+00
divergent__                0.0e+00      nan  0.0e+00   0.00   0.0e+00   0.0e+00      nan      nan      nan
energy__                   7.8e+01  2.0e-01  7.3e+00     67   7.7e+01   9.0e+01  1.3e+03  1.6e+02  1.0e+00
log_beta_raw[1,1,1]       -8.1e-02  1.2e-02  9.6e-01   -1.6  -1.0e-01   1.5e+00  6.6e+03  8.2e+02  1.0e+00
log_beta_raw[1,1,2]        9.8e-02  1.2e-02  9.9e-01   -1.5   1.1e-01   1.7e+00  6.6e+03  8.2e+02  1.0e+00

combo3, campfire:

Inference for Stan model: blrm_exnex_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (0.21, 0.22, 0.22, 0.23) seconds, 0.88 seconds total
Sampling took (3.3, 3.3, 3.3, 3.3) seconds, 13 seconds total

                              Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                      -7.2e+01  1.8e-01  7.1e+00    -84  -7.1e+01  -6.0e+01  1.6e+03  1.2e+02  1.0e+00
accept_stat__              9.0e-01  2.4e-03  1.3e-01   0.65   9.5e-01   1.0e+00  2.8e+03  2.2e+02  1.0e+00
stepsize__                 4.4e-01  1.6e-02  2.3e-02   0.40   4.5e-01   4.6e-01  2.0e+00  1.5e-01  1.4e+13
treedepth__                4.0e+00      nan  1.6e-02    4.0   4.0e+00   4.0e+00      nan      nan      nan
n_leapfrog__               1.5e+01      nan  1.6e-01     15   1.5e+01   1.5e+01      nan      nan      nan
divergent__                5.0e-04      nan  2.2e-02   0.00   0.0e+00   0.0e+00      nan      nan      nan
energy__                   1.2e+02  2.7e-01  1.0e+01    105   1.2e+02   1.4e+02  1.4e+03  1.1e+02  1.0e+00
log_beta_raw[1,1,1]       -5.1e-02  1.2e-02  1.0e+00   -1.7  -5.8e-02   1.6e+00  6.6e+03  5.0e+02  1.0e+00
log_beta_raw[1,1,2]       -8.5e-03  1.5e-02  9.8e-01   -1.6  -5.8e-03   1.6e+00  4.5e+03  3.4e+02  1.0e+00

combo3, cmdstan

Inference for Stan model: blrm_exnex_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (3.8, 3.9, 3.8, 3.8) seconds, 15 seconds total
Sampling took (3.3, 3.3, 3.3, 3.3) seconds, 13 seconds total

                              Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                      -7.2e+01  1.8e-01  7.1e+00    -84  -7.2e+01  -6.1e+01  1.6e+03  1.2e+02  1.0e+00
accept_stat__              9.0e-01  2.5e-03  1.4e-01   0.61   9.5e-01   1.0e+00  3.3e+03  2.5e+02  1.0e+00
stepsize__                 2.4e-01  1.2e-02  1.7e-02   0.21   2.5e-01   2.6e-01  2.0e+00  1.5e-01  1.8e+13
treedepth__                4.0e+00      nan  5.7e-02    4.0   4.0e+00   4.0e+00      nan      nan      nan
n_leapfrog__               1.5e+01      nan  1.3e+00     15   1.5e+01   1.5e+01      nan      nan      nan
divergent__                1.3e-03      nan  3.5e-02   0.00   0.0e+00   0.0e+00      nan      nan      nan
energy__                   1.2e+02  2.6e-01  1.0e+01    106   1.2e+02   1.4e+02  1.5e+03  1.1e+02  1.0e+00
log_beta_raw[1,1,1]       -2.8e-02  1.3e-02  9.7e-01   -1.6  -4.6e-02   1.5e+00  5.5e+03  4.1e+02  1.0e+00
log_beta_raw[1,1,2]        1.1e-03  1.3e-02  1.0e+00   -1.7   7.4e-04   1.6e+00  6.1e+03  4.6e+02  1.0e+00

Does that help?

1 Like

Yes. Campfire warmup is 20 times faster. The adaptation result is the same.

1 Like

Nonono, that warmup time only includes the final timestep adaptation so that warmup time is misleading. Looks like @wds15’s models are well enough specified that diagonal is good enough.

The only gain to be had would be if the warmup terminates earlier than the default warmup. @wds15, how many windows did it use in the initial warmup periods? Basically how many rows in the tables of numbers that were printing out.

1 Like

Combo3 runtimes:

> camp_run
   user  system elapsed 
 22.212   0.397   9.158 
> cmdstan_run
   user  system elapsed 
 28.619   0.109   7.286 

final campfire warmup message for combo3:

[1] "Adaptive warmup debug info (figure out how many initial windows to drop):"
# A tibble: 2 x 4
  drop_first_n_windows picked max_Rhat min_Bulk_ESS
                 <int> <chr>     <dbl>        <dbl>
1                    0 picked     1.04          381
2                    1 ""         1.07          182
[1] "Metric calculation info (lower c better, last is always picked, minimum is 1.0):"
# A tibble: 7 x 2
  name       c_hybrid
  <chr>         <dbl>
1 dense          7.82
2 hess1_wish     6.85
3 hess2_wish     6.71
4 lw2004         5.65
5 diag           5.47
6 hess1          5.36
7 hess2          5.27

combo2 runtime:

> camp_run
   user  system elapsed 
  9.553   0.322   4.518 
> cmdstan_run
   user  system elapsed 
 28.619   0.109   7.286 

final campfire message for combo2

[1] "Adaptive warmup debug info (figure out how many initial windows to drop):"
# A tibble: 2 x 4
  drop_first_n_windows picked max_Rhat min_Bulk_ESS
                 <int> <chr>     <dbl>        <dbl>
1                    0 ""         1.03          179
2                    1 picked     1.04          234
[1] "Metric calculation info (lower c better, last is always picked, minimum is 1.0):"
# A tibble: 7 x 2
  name       c_hybrid
  <chr>         <dbl>
1 diag          13.7 
2 hess1         13.3 
3 lw2004        11.5 
4 dense         10.9 
5 hess1_wish    10.7 
6 hess2_wish     9.74
7 hess2          9.21
2 Likes

Oh okay so your models are well parameterized enough that this mechanism decides they only need 200 iterations of warmup each (there are only two rows in the Adaptive warmup debug info table and window_size by default is 100).

You might try window_size = 50 and see if you can go even lower, but we’re computing Rhats on these windows so you don’t want them too small (in the Rhat calculations we’ll split them in half again and compute marginal variance estimates).

@seantalts, @wds15, @yizhang anyone got time to go through with me how MPI currently works in Stan?

I was talking to Yi and trying to think how MPI would work with this, but I’m distracted thinking how MPI works with Math at all.

I want to know how MPI works because I would like to write down what it would take to do MPI in the chain() part of autodiff.