Chkptstanr: checkpoint MCMC sampling in Stan

We recently had our package, chkptstanr, accepted at CRAN !

The basic idea is to start and stop the sampler, as needed.

The package was actually a request from AWS, in that they asked us to make some functionality for using Stan with their so-called “spot instances” (this can reduce the cost considerably).

We followed a suggestion on this forum, in particular, from @Bob_Carpenter:

"You’ll need step size, the mass matrix or metric (making sure to get the inversion right), and the last draw to use as an initialization. Then you need to configure NUTS to run with no warmup and just keep using the step size and mass matrix you provide " (Current state of checkpointing in Stan)

This is now done “under the hood”, so the overall user experience is much like using Stan or brms. In fact, the package is compatible with brms (and posterior, bayesplot, etc…), in that, internally, the Stan code is generated, then fitted with cmdstanr, and then the returned object is of class brmsfit. This was important for us, because now all the other brms functions can be used seamlessly (e.g., pp_check()).

There are some caveats we came across when developing the package:

(1) there is quite a bit of overhead for extracting the information, saving, etc… So this can make model fitting take much longer, so need to consider just how many checkpoints are needed.

(2) we found that there must be an initial period that cannot be interrupted. Once past this, at least in our tests, it is very similar to just fitting without stopping.

14 Likes

Hi @donny,

This is absolutely great to see, being the original author of that question and a big proponent of opportunistic computing.

I do have a few questions if you dont mind.

  • Do you have any figures of merit for the number of samples to save before checkpointing? I realize this is contingent on data size and model complexity.
  • How long is that initial period mentioned in (2)? Presumably this is number of samples…
  • Do you get the same (exact) results from checkpointed sample as from a chain that is just allowed to run to the end?
1 Like
  • Do you have any figures of merit for the number of samples to save before checkpointing? I realize this is contingent on data size and model complexity.

We don’t. We often have millions of rows, MLM with many “random” (or varying) effects. In our tests, we found that 150 to 200 seemed to work nicely, as mention on a different Stan forum post about finding the “typical set”.

That said, I plan to make a vignette about just this issue to show what can happen…

  • How long is that initial period mentioned in (2)? Presumably this is number of samples…

Over 100, and I bet it does depend on model complexity, etc.

  • Do you get the same (exact) results from checkpointed sample as from a chain that is just allowed to run to the end?

I cannot say if it is “exact”. But we found that the checkpointed samples (and summaries therein) where very (very) similar to a model that was allowed to run to the end. Pretty sure there is an example in the brms vignette that also includes a model that was allowed to run to the end.

3 Likes

Hello, please let me know if this is the wrong place to post this!

I’m attempting to use chkptstanr for fitting some rather lengthy brms models on my University HPC. However, I haven’t managed to get beyond the following error when resuming the fitting of a model that was interrupted:

Error in cmdstanr::cmdstan_model(stan_file = stan_code_path, cpp_options = list(stan_threads = TRUE)) : 
  object 'stan_code_path' not found

I’m also getting this error if the model had completed fitting, and I attempt to resume it after, for which I understand I should be getting the following message:

#> Sampling next checkpoint
#> Checkpointing complete

I’m getting the same error when running the source code for the “checkpointing: brms” vignette locally on my own computer so I can’t figure out what the cause could be!

Here is a reproducible example taken from the above mentioned vignette:

library(chkptstanr)
library(posterior)
library(bayesplot)
library(ggplot2)
library(brms)
library(cmdstanr)

path <- create_folder(folder_name  = "chkpt_folder_m1")

bf_m1 <- bf(formula = count ~ zAge + zBase  + (1 | patient),
            family = poisson())

fit_m1 <- chkpt_brms(
  formula = bf_m1,
  data = epilepsy,
  path  = path,
  iter_warmup = 1000,
  iter_sampling = 1000,
  iter_per_chkpt = 250)

# resuming
fit_m1 <- chkpt_brms(
  formula = bf_m1,
  data = epilepsy,
  path  = path,
  iter_warmup = 1000,
  iter_sampling = 1000,
  iter_per_chkpt = 250)

## Error in cmdstanr::cmdstan_model(stan_file = stan_code_path, cpp_options = list(stan_threads = TRUE)) : 
##  object 'stan_code_path' not found

Session info:

R version 4.1.1 (2021-08-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] cmdstanr_0.5.3   brms_2.17.0      Rcpp_1.0.9       ggplot2_3.3.6    bayesplot_1.9.0  posterior_1.2.2 
[7] chkptstanr_0.1.1

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3        ellipsis_0.3.2          ggridges_0.5.3          markdown_1.1           
  [5] base64enc_0.1-3         rstudioapi_0.13         farver_2.1.0            rstan_2.21.5           
  [9] DT_0.23                 fansi_1.0.3             mvtnorm_1.1-3           diffobj_0.3.5          
 [13] bridgesampling_1.1-2    codetools_0.2-18        brmstools_0.5.3         mnormt_2.1.0           
 [17] doParallel_1.0.17       knitr_1.39              shinythemes_1.2.0       jsonlite_1.8.0         
 [21] shiny_1.7.2             compiler_4.1.1          backports_1.2.1         assertthat_0.2.1       
 [25] Matrix_1.3-4            fastmap_1.1.0           cli_3.3.0               later_1.3.0            
 [29] htmltools_0.5.2         prettyunits_1.1.1       tools_4.1.1             igraph_1.3.2           
 [33] coda_0.19-4             gtable_0.3.0            glue_1.6.2              reshape2_1.4.4         
 [37] clusterGeneration_1.3.7 dplyr_1.0.7             maps_3.4.0              fastmatch_1.1-3        
 [41] raster_3.5-21           vctrs_0.4.1             ape_5.6-2               nlme_3.1-152           
 [45] iterators_1.0.14        crosstalk_1.1.1         tensorA_0.36.2          xfun_0.31              
 [49] stringr_1.4.0           ps_1.6.0                mime_0.12               miniUI_0.1.1.1         
 [53] lifecycle_1.0.1         phangorn_2.9.0          gtools_3.9.2            VoCC_1.0.0             
 [57] terra_1.5-34            MASS_7.3-54             zoo_1.8-10              scales_1.2.0           
 [61] colourpicker_1.1.1      promises_1.2.0.1        Brobdingnag_1.2-7       parallel_4.1.1         
 [65] inline_0.3.19           expm_0.999-6            shinystan_2.6.0         RColorBrewer_1.1-3     
 [69] yaml_2.3.5              geosphere_1.5-14        gridExtra_2.3           loo_2.5.1              
 [73] StanHeaders_2.21.0-7    stringi_1.7.6           dygraphs_1.1.1.6        foreach_1.5.2          
 [77] plotrix_3.8-2           checkmate_2.0.0         phytools_1.0-3          boot_1.3-28            
 [81] pkgbuild_1.3.1          rlang_1.0.3             pkgconfig_2.0.3         matrixStats_0.62.0     
 [85] distributional_0.3.0    evaluate_0.15           lattice_0.20-44         purrr_0.3.4            
 [89] rstantools_2.2.0        htmlwidgets_1.5.4       cowplot_1.1.1           processx_3.5.2         
 [93] tidyselect_1.1.1        plyr_1.8.7              magrittr_2.0.3          R6_2.5.1               
 [97] generics_0.1.0          combinat_0.0-8          DBI_1.1.1               withr_2.5.0            
[101] pillar_1.7.0            xts_0.12.1              scatterplot3d_0.3-41    abind_1.4-5            
[105] sp_1.5-0                tibble_3.1.7            crayon_1.5.1            utf8_1.2.2             
[109] rmarkdown_2.14          grid_4.1.1              data.table_1.14.2       callr_3.7.0            
[113] threejs_0.3.3           CircStats_0.2-6         digest_0.6.29           gdistance_1.3-6        
[117] xtable_1.8-4            httpuv_1.6.5            numDeriv_2016.8-1.1     RcppParallel_5.1.5     
[121] stats4_4.1.1            munsell_0.5.0           quadprog_1.5-8          shinyjs_2.1.0          

I would be immensely grateful if anyone would be able to help me with this!
Thank you in advance, Jakob

I think I’ve come up with a workaround:

Looking through the source code for chkpt_brms (link) it looks like stan_code_path only gets defined if the file /stan_model/model.stan doesn’t already exist as seen in the following section of code:

If you are resuming fitting then this file will already exist and stan_code_path won’t get defined. Adding the following line before attempting to resume fitting seems to solve the problem:
stan_code_path <- paste0(path, "/stan_model/model.stan")

This seems like a bug in the code to me, maybe something that needs looking at?

1 Like

Hi @donny,

Is this project alive? I love the idea, but the above error appeared with the vignette example on both Windows and Ubuntu. I see it mentioned several times inseveral places, but the repo has had no activity in the lasts 2 years.

I implemented the fix and opened a pull request: Fix stan_code_path not found by venpopov · Pull Request #14 · donaldRwilliams/chkptstanr · GitHub

In the meantime, if anyone else wants to use it, the fixed version can be install from my fork: remotes::install_github(‘venpopov/chkptstanr’)

4 Likes

Thank you very much for working on this package @Ven_Popov. Coincidentally I began trying to fix a couple of bugs earlier this week, so I’m really glad I came across your excellent repository today!

1 Like

@frank_hezemans glad you found it useful - I just release v0.2.0-alpha, summarized here: Chkptstanr v0.2.0-alpha: checkpoint brms and cmdstanr sampling

However, as I note there, I recommend using it with big caution. The package has a big issue with how it deals with the adaptation and warmup, as you can see in the discussion here: What is the point of doing extra "typical" initial warmups not done when not using chkptstanr? · Issue #10 · venpopov/chkptstanr · GitHub

If you use it I recommend setting iter_adaptation (previously iter typical to a much higher value). In essence, you cannot do checkpoint ING during warmup, as the real warmup is the initial adaptation, and not what the original package calls iter_warmup

2 Likes

Hi, very excited to use your package to help with some long-running models. The problem I currently have is working out how to pass chkpt_stan the path to a .stan file containing the model.

The examples I’ve found seem to be based on using make_stancode() from brms to produce a character string object, while if I save the code as a .stan file it doesn’t work.

Reproducable example: First build the relevant folder

path <- create_folder(folder_name = "chkpt_folder_fit1")

#Outputs model as text
make_stancode(bf(formula = count ~ zAge + zBase * Trt + (1|patient),
                              family = poisson()),
                           data = epilepsy)

Save the text model output in a .stan file named “model_code.stan” in the chkpt_folder_fit1 folder. Then:

stan_code <- file.path(path, "model_code.stan")
stan_data <- make_standata(bf(formula = count ~ zAge + zBase * Trt + (1|patient),
                              family = poisson()),
                           data = epilepsy)

# fails with error "object 'stan_code_path' not found"
fit1 <- chkpt_stan(model_code = stan_code, 
                   data = stan_data,
                   iter_warmup = 1000,
                   iter_sampling = 1000,
                   iter_per_chkpt = 250,
                   path = path)

My model files are pretty complex including custom functions so a way to pass the .stan path would be preferred.

Thanks for sharing. That sounds super cool and isn’t something we ever figured out how to do easily for Stan itself. I’m blown away that this was a request from Amazon!

I should’ve said in that original message you need to keep track of where you are in warmup and then reconfigure all the warmup tuning parameters to do the right thing. This is really the thing that’s put us off trying this ourselves the most.

An effective sample size of 100 is usually sufficient for inference. The rate of checkpointing will be determined by failure/time-out rates and the cost of checkpointing relative to redoing the work.

Another thing that might be useful for long-running chains is MCMCMonitor:

2 Likes

hi

does this work for you?

replace your line

stan_code <- file.path(path, "model_code.stan")

by

setwd(path)
stan_code<-readLines("model_code.stan")
setwd("../")

that way, both stan_data and stan_code are objects in the working directory and fit1 runs ok (on my machine)

also, I found yesterday that cmdstan 2.35 is not working with chkptstanr but cmdstan 2.34.1 is ok, so make sure cmdstanr uses the 2.34.1 version.

thanks

Greg