There’s a new, experimental warmup algorithm on the block! It’s available as an r package here: campfire (get it? campfires are for warming up)
The the big features are:
- Adaptive length warmup – only run warmup as long as you need
- Don’t throw away warmup samples unnecessarily
- Use information from multiple chains during warmup
- Automatically switch between dense, diagonal, and hessian-based metrics
This is not suitable for deployment (you’ll notice it has zero tests). This is exclusively for testing a new proposal for adaptive warmup.
If it works for people, then we can get it integrated into Stan and make it an actual production thing, until then it’s experimental.
So this is why we need people to test it and validate it! Hooray!
How the adaptation works
Before we go into the new stuff, let’s go through the current stuff.
The current adaptation is broken down into windows by number of draws:
75 | 25 | 50 | 100 | 200 | 500 | 50
So that’s 75 draws in the first window, 25 in the second, 50 in the third, etc.
Stan’s timestep is adapted at every draw in all of the windows. The timestep adaptation happens faster in the first 75 and last 50 draws.
At the end of each of these windows (where the bars are) this thing called the euclidean metric (or the mass matrix) is recomputed. This was a linear transformation of the variables that basically makes the HMC algorithm work faster. By default Stan uses a diagonal covariance estimate for this.
This works well enough. But we wanna improve it! The downsides of this are:
- Our warmup is the same length for any model unless we carefully adjust it. We either waste time adapting too much, or do not adapt enough. Either way it is not ideal.
- At the end of each warmup window we toss all the previous adaptation. This seems wasteful
- We are currently adapting each chain individually
- Picking between dense and diagonal adaptation isn’t always obvious
Instead of using variable width windows we used a fixed window size. As a default we use 100 draws, so we get something like:
100 | 100 | 100 | 100 ...
At the end of each window the metric gets recomputed (just like before). The difference now is instead of just using draws from the last window we want to potentially use all draws. To do this we take advantage of our effective sample size (ESS) estimates.
If we label our windows with letters:
A | B | C | D ...
we expect the draws in A to somehow be worse than those in B or C or D (because the later windows will be closer to the answer or there will be more adaptation). So a simple way to ask if the draws in A are inconsistent with BCD is to check Rhat (or ESS) with all the windows (ABCD) vs. just BCD.
We can go ahead and compare all combinations of ABCD, BCD, CD, and D and figure out the best one by just picking whichever has the largest ESS.
Similarly, we can decide when to stop adapting when we have hit sufficient ESS on all parameters and the Rhats are suitably low. Because we lean on Rhat and ESS, we only do this warmup with multiple chains. Because of this, when it’s time to compute a new metric, we use samples from both chains to do this!
That’s it! I also threw all the stuff from https://arxiv.org/abs/1905.11916 in there to automatically switch between different metrics and whatnot but these are really separable things.
This requires the posterior and cmdstanr packages, so gotta install those first:
And then for cmdstan (which requires posterior be installed first):
If you haven’t installed cmdstanr before you can use:
To download and build an appropriate cmdstan for use in cmdstanr (don’t need anything special for this).
The warmup package can be installed with:
Campfire manages the warmup part of adaptation. So you call campfire::warmup with the same arguments as a normal cmdstanr model$sample call:
library(campfire) library(rstan) model_path = "accel_splines.stan" data_env = new.env(parent = baseenv()) source("accel_splines.data.R", local = data_env) data = as.list.environment(data_env) stan_fit = stan(model_path, data = data, iter = 1) out = warmup(model_path, stan_fit, data = data, num_chains = 2, num_cores = 2, print_stdout = FALSE)
And you can take the output of a warmup call and use it to sample the actual model. It’ll use the metric computed in campfire::warmup, but it’ll still do 50 steps of timestep adaptation at the beginning:
model = cmdstanr::cmdstan_model(model_path) fit = do.call(model$sample, out$args)
You should be able to pass most arguments you would pass to sample to warmup and it’ll just carry them through (like number of chains, amount of sampling, etc). There are a couple extra arguments to warmup:
- print_stdout (boolean, default FALSE) whether to print the fits running – they produce lots of output
- window_size (integer, default 100) size of warmup windows
- max_num_windows (integer, default 10) maximum number of warmup windows to use
- target_rhat (double, default 1.05) all parameters must have an rhat less than this to consider warmup finished
- target_ess (integer, default 50) all parameters must have a bulk_ESS greater than this to consider warmup finished
The out object is a list with 3 things: args, which are arguments suitable for calling sample in cmdstanr with, usamples, which are the unconstrained warmup samples computed in adaptation, and nleapfrogs, which is the total number of leapfrog steps taken in the calculation.
While adaptation is happening you’ll see two types of things printing. The first are tibbles of condition number:
 "Metric calculation info (lower c better, last is always picked, minimum is 1.0):" # A tibble: 7 x 2 name c_hybrid <chr> <dbl> 1 diag 325. 2 dense 207. 3 hess1_wish 207. 4 hess2_wish 207. 5 lw2004 200. 6 hess2 118. 7 hess1 117.
So at each adaptation we’re proposing a few different metrics and picking between them based on this condition number thing. Lower is better, the reasoning behind this is explained here (and some docs on what those different metric options are): https://arxiv.org/abs/1905.11916
The other is a table:
 "Adaptive warmup debug info (figure out how many initial windows to drop):" # A tibble: 4 x 4 drop_first_n_windows picked max_Rhat min_Bulk_ESS <int> <chr> <dbl> <dbl> 1 0 "" 1.26 8 2 1 "" 1.15 24 3 2 picked 1.04 64 4 3 "" 1.09 23
Which is used for the decision making of how many windows of warmup to use/drop. In this case we have decided to drop the first 2 warmup windows. Dropping these two warmups gives us a maximum Rhat of 1.04 and a minimum Bulk_ESS of 64. We pick adaptations based on maximizing Bulk_ESS. We can look at a plot of the warmup samples of one chain:
plot(out$usamples[,1,2]) # plot warmup draws for chain 1, parameter 2
It sure seems like dropping those first two hundred samples was a good idea!
All feedback is good. There’s probably lots of stuff wrong with this so lemme know. Some other questions are:
Does this adaptation work for your model?
Does this adaptation result in the sampling stage of your model running more quickly?
Does ESS or Rhat go up or down?
Does this adaptation fail for your model, especially if it works with the old adaptation mechanism?
How many warmup windows does it take for your model to decide it is warmed up sufficiently?
The way the package is coded right now, the warmup stage itself might take longer to run even with fewer draws cause it’s just not efficient. Also there are the 50 extra timestep adaptation steps that are not included in the campfire$warmup call.
If you have a model that behaves weirdly and can share it with data, please do!
So the ideas for the different bits here came from a bunch of people from this so I’m just gonna leave a pile of names down here, hoooray! Thanks for the ideas (edit: and help on everything else) everyone!: