Chkptstanr v0.2.0-alpha: checkpoint brms and cmdstanr sampling

This is an update to: Chkptstanr: checkpoint MCMC sampling in Stan

tldr;

This pre-release fixes most major bugs of the original chkptstanr package, which allows you to stop and resume sampling with brms and cmdstanr via regular checkpoints. However, I now believe that adaptation is performed incorrectly, as discussed here. Thus, I do not recommend using it for any production work. The package now works - you can stop and start sampling at will - you just have to be mindful that warmup might not be doing what you think. I wondered whether to post this at all, but given that the original package is on CRAN, I decided that it is better to make this clear, and potentially open up the discussion for how to improve it.

The current detailed Development Roadmap is available here.

Release notes chkptstanr v0.2.0-alpha

New maintainer

  • With the permission of the original creator, Donald R. Williams, Ven Popov becomes the new maintainer of the package. The development will continue at venpopov/chkptstanr.

Major bug fixes

  • Resolve error “stan_code_path” not found when resuming sampling, which completely prevented the core functionality of the package from working (original issue #8]
  • Resolve incorrect detection of existing model binaries, which was causing the package to fail to detect changes to arguments and incorrectly continue to sample (#2)
  • Fix the incorrect combination of checkpoint samples into a single stanfit object, which was causing problems with post-processing methods (#8)
  • chkpt_brms() now works with any brm() arguments, including custom families, data2, etc, rather than giving an error (original issue #15)

New features

  • Add argument “stop_after” to predetermine a stopping checkpoint. This allows you to predetermine a fixed point to stop the sampling after a certain number of iterations, e.g. stop_after = 1000 will stop the sampling after 1000 iterations. (original issue #4)
  • Add argument reset to restart sampling. This allows you to reset the checkpointing process and start from the beginning without recompiling the model. Setting reset = TRUE will delete the existing checkpoints but keep the stan model code and binary. This is also available via the new function reset_checkpoints(path), which achieves the same.
  • Return a brmsfit object when sampling is interrupted. Instead of having to reconstruct the samples manually, chkpt_brms() now returns a brmsfit object if post-warmup sampling is stopped for any reason, either programmatically via stop_after, because of an error, or due to a manual abort by the user. The brmsfit object will contain samples until the last successful checkpoint. You can resume sampling from the last checkpoint by rerunning the same code. (#4)
  • No longer necessary to manually create a folder for the checkpoints via “create_folder()” before using chkpt_brms() or chkpt_stan(). create_folder() is deprecated. Please provide the folder name or full path to the argument path directly to chkpt_brms() and a folder to store the checkpoints will be created automatically. This significantly simplifies the workflow.
  • You can now reuse checkpoint folders. The path argument to chkpt_brms() and chkpt_stan() no longer give an error if a folder already exist, allowing a reusable programmatic workflow
  • Checkpoint folders can be specified with a nested path. The path argument to chkpt_brms() and chkpt_stan() works with nested folder names, e.g. "output/checkpoints1", even if output/ does not exist
  • You can now use any formula that brm accepts. Remove an unnecessary check that the formula should be a brmsformula object, allowing for more flexibility in the input to chkpt_brms() such as mvbrmsformula objects or other arguments that brm() accepts (original issue #9)

Minor bug fixes

  • Fix an incorrect error message when providing iter_warmup, iter_sampling, or iter_warmup+iter_sampling not divisible by iter_per_chkpt. The error message now correctly states that the number of iterations per checkpoint must be a divisor of the all three quantities.

Other changes

  • Automated testing for package stability. Set-up initial automated testing and continuous integration with GitHub Actions to ensure the package is always working as expected
  • Change default number of chains from 2 to 4 to be consistent with brms defaults
  • Rename argument “iter_typical” to “iter_adaptation” to better reflect what this stage is doing. iter_typical is deprecated. In future releases, the adaptation procedure will be rewritten and this argument will be completely removed (see #10)
6 Likes

This looks great!

2 Likes

@Ven_Popov if this works with cmdstanr (when just using cmdstanr, not only as a backend to brms) then would you be interested in adding something to one of the cmdstanr vignettes about this with a link to the package? For example, @wlandau added a section in one of the vignettes about adding pre-compiled Stan models to R packages that links to his instantiate package, so we could do something similar with your package to help users find it. (Or maybe we wait until after you’re past the alpha version if you plan on making a bunch of changes?) What do you think?

1 Like

It does works with cmdstanr. But I think we should wait. First, here’s an example, and afterwards why I think we should wait:

Example (expand for details)

Currently you can run a cmdstanr model with checkpointing like this:

stan_code <- "
data {
  int<lower=0> N;
  vector[N] y;
}
parameters {
  real mu;
  real sigma;
}
model {
  y ~ normal(mu, sigma);
}
"

stan_data <- list(N = 10, y = rnorm(10))

fit1 <- chkpt_stan(
  model_code = stan_code,
  data = stan_data,
  iter_warmup = 1000,
  iter_sampling = 4000,
  iter_per_chkpt = 250,
  stop_after = 2000,
  path = "local/chkpt_folder_fit1"
)

will stop the sampling after 2000 iterations

You get the following output:

Sampling will stop after checkpoint 4
Compiling Stan program...
Initial Warmup (Typical Set)
Chkpt: 1 / 10; Iteration: 500 / 5000 (warmup)
Chkpt: 2 / 10; Iteration: 1000 / 5000 (warmup)
Chkpt: 3 / 10; Iteration: 1500 / 5000 (sample)
Chkpt: 4 / 10; Iteration: 2000 / 5000 (sample)
Stopping after 4 checkpoints

At the moment, this does not return a stanfit object, but it is on my agenda to do that (just like I have done for chkpt_brms). One could extract the draws:

draws <- combine_chkpt_draws(object = fit1)

or continue sampling:

fit1 <- chkpt_stan(
  model_code = stan_code,
  data = stan_data,
  iter_warmup = 1000,
  iter_sampling = 4000,
  iter_per_chkpt = 500,
  path = "local/chkpt_folder_fit1"
)

with output:

Model executable is up to date!
Chkpt: 5 / 10; Iteration: 2500 / 5000 (sample)
Chkpt: 6 / 10; Iteration: 3000 / 5000 (sample)
Chkpt: 7 / 10; Iteration: 3500 / 5000 (sample)
Chkpt: 8 / 10; Iteration: 4000 / 5000 (sample)
Chkpt: 9 / 10; Iteration: 4500 / 5000 (sample)
Chkpt: 10 / 10; Iteration: 5000 / 5000 (sample)
Checkpointing complete

I don’t think it’s ready for general use because of this issue. In summary, I have discovered that:

  • There are three sampling stages:

    • initial warmup (controlled by iter_adaptation, with default 150)
    • secondary warmup (controlled by iter_warmup)
    • sampling (controlled by iter_sampling)
  • chkptstanr does adaptation only during an initial warmup. It does not do any checkpointing during this period. The inv_matrix, step_size and final draws (used as inits) are passed to the next stage. Then it begins checkpointing. Even though it has an argument “iter_warmup”, during those checkpoints adapt_engaged is FALSE, and it only uses the adaptation during the initial period. The only thing that distinguishes these warmup checkpoints from the sampling checkpoints after is that they are not saved.
  • thus, to replicate a cmdstanr model that uses iter_warmup=1000, iter_sampling=1000, you would have to set iter_adaptation=1000, iter_warmup=0, iter_sampling=1000. This will correctly do the adaptation, but you will only get checkpointing during the sampling process

Thus, checkpointing works, but the original package was misleading in how it treated warmup. Before this is ready for general use, I want to try to get checkpointing to work during the adaptation period, get rid of the extra warmup stage, and add some niceties to the checkpointing with cmdstanr just like I have for the brms version (at the very least, return a CmdStanFit object).

Once I get that to work properly, it would be awesome to do as you suggested!

2 Likes

Now thinking of it, I could also add an option to just pass it a ‘CmdStanModel’ so that it could work like this:

m1 <- cmdstanr::cmdstan_model(stan_file = some_file)
fit <- sample_checkpoint(model = m1,
                         data = stan_data, 
                         chains = 4, 
                         parallel_chains = 4, 
                         iter_warmup = 1000, 
                         iter_sampling = 4000,
                         iter_per_chkpt = 500)

That could be nice to align it with how you would use m1$sample


But maybe if I get this to work, there would be no need for an extra package and we could work to integrate it into cmdstanr as options to the sample method? Is that something you would be interested in (provided it actually works and is extensively tested), or would you prefer to keep this functionality external? As much as I like having my own package, I don’t mind if it makes it into the base code and simplifies the usage.

1 Like

I tentatively suggest to keep it separate, which retains the flavor of cmdstanr as a lightweight wrapper to cmdstan. If we make this a cmdstanr method, then updates to the default adaptation schedule in cmdstan will forever require updates in cmdstanr.

3 Likes

Ok great, sounds good.

Yeah I would lean towards keeping it separate, but we’ll promote it to make sure users are aware that it’s available.

3 Likes