Chkptstanr: checkpoint MCMC sampling in Stan

We recently had our package, chkptstanr, accepted at CRAN !

The basic idea is to start and stop the sampler, as needed.

The package was actually a request from AWS, in that they asked us to make some functionality for using Stan with their so-called “spot instances” (this can reduce the cost considerably).

We followed a suggestion on this forum, in particular, from @Bob_Carpenter:

"You’ll need step size, the mass matrix or metric (making sure to get the inversion right), and the last draw to use as an initialization. Then you need to configure NUTS to run with no warmup and just keep using the step size and mass matrix you provide " (Current state of checkpointing in Stan)

This is now done “under the hood”, so the overall user experience is much like using Stan or brms. In fact, the package is compatible with brms (and posterior, bayesplot, etc…), in that, internally, the Stan code is generated, then fitted with cmdstanr, and then the returned object is of class brmsfit. This was important for us, because now all the other brms functions can be used seamlessly (e.g., pp_check()).

There are some caveats we came across when developing the package:

(1) there is quite a bit of overhead for extracting the information, saving, etc… So this can make model fitting take much longer, so need to consider just how many checkpoints are needed.

(2) we found that there must be an initial period that cannot be interrupted. Once past this, at least in our tests, it is very similar to just fitting without stopping.

10 Likes

Hi @donny,

This is absolutely great to see, being the original author of that question and a big proponent of opportunistic computing.

I do have a few questions if you dont mind.

  • Do you have any figures of merit for the number of samples to save before checkpointing? I realize this is contingent on data size and model complexity.
  • How long is that initial period mentioned in (2)? Presumably this is number of samples…
  • Do you get the same (exact) results from checkpointed sample as from a chain that is just allowed to run to the end?
  • Do you have any figures of merit for the number of samples to save before checkpointing? I realize this is contingent on data size and model complexity.

We don’t. We often have millions of rows, MLM with many “random” (or varying) effects. In our tests, we found that 150 to 200 seemed to work nicely, as mention on a different Stan forum post about finding the “typical set”.

That said, I plan to make a vignette about just this issue to show what can happen…

  • How long is that initial period mentioned in (2)? Presumably this is number of samples…

Over 100, and I bet it does depend on model complexity, etc.

  • Do you get the same (exact) results from checkpointed sample as from a chain that is just allowed to run to the end?

I cannot say if it is “exact”. But we found that the checkpointed samples (and summaries therein) where very (very) similar to a model that was allowed to run to the end. Pretty sure there is an example in the brms vignette that also includes a model that was allowed to run to the end.

2 Likes

Hello, please let me know if this is the wrong place to post this!

I’m attempting to use chkptstanr for fitting some rather lengthy brms models on my University HPC. However, I haven’t managed to get beyond the following error when resuming the fitting of a model that was interrupted:

Error in cmdstanr::cmdstan_model(stan_file = stan_code_path, cpp_options = list(stan_threads = TRUE)) : 
  object 'stan_code_path' not found

I’m also getting this error if the model had completed fitting, and I attempt to resume it after, for which I understand I should be getting the following message:

#> Sampling next checkpoint
#> Checkpointing complete

I’m getting the same error when running the source code for the “checkpointing: brms” vignette locally on my own computer so I can’t figure out what the cause could be!

Here is a reproducible example taken from the above mentioned vignette:

library(chkptstanr)
library(posterior)
library(bayesplot)
library(ggplot2)
library(brms)
library(cmdstanr)

path <- create_folder(folder_name  = "chkpt_folder_m1")

bf_m1 <- bf(formula = count ~ zAge + zBase  + (1 | patient),
            family = poisson())

fit_m1 <- chkpt_brms(
  formula = bf_m1,
  data = epilepsy,
  path  = path,
  iter_warmup = 1000,
  iter_sampling = 1000,
  iter_per_chkpt = 250)

# resuming
fit_m1 <- chkpt_brms(
  formula = bf_m1,
  data = epilepsy,
  path  = path,
  iter_warmup = 1000,
  iter_sampling = 1000,
  iter_per_chkpt = 250)

## Error in cmdstanr::cmdstan_model(stan_file = stan_code_path, cpp_options = list(stan_threads = TRUE)) : 
##  object 'stan_code_path' not found

Session info:

R version 4.1.1 (2021-08-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.4

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] cmdstanr_0.5.3   brms_2.17.0      Rcpp_1.0.9       ggplot2_3.3.6    bayesplot_1.9.0  posterior_1.2.2 
[7] chkptstanr_0.1.1

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3        ellipsis_0.3.2          ggridges_0.5.3          markdown_1.1           
  [5] base64enc_0.1-3         rstudioapi_0.13         farver_2.1.0            rstan_2.21.5           
  [9] DT_0.23                 fansi_1.0.3             mvtnorm_1.1-3           diffobj_0.3.5          
 [13] bridgesampling_1.1-2    codetools_0.2-18        brmstools_0.5.3         mnormt_2.1.0           
 [17] doParallel_1.0.17       knitr_1.39              shinythemes_1.2.0       jsonlite_1.8.0         
 [21] shiny_1.7.2             compiler_4.1.1          backports_1.2.1         assertthat_0.2.1       
 [25] Matrix_1.3-4            fastmap_1.1.0           cli_3.3.0               later_1.3.0            
 [29] htmltools_0.5.2         prettyunits_1.1.1       tools_4.1.1             igraph_1.3.2           
 [33] coda_0.19-4             gtable_0.3.0            glue_1.6.2              reshape2_1.4.4         
 [37] clusterGeneration_1.3.7 dplyr_1.0.7             maps_3.4.0              fastmatch_1.1-3        
 [41] raster_3.5-21           vctrs_0.4.1             ape_5.6-2               nlme_3.1-152           
 [45] iterators_1.0.14        crosstalk_1.1.1         tensorA_0.36.2          xfun_0.31              
 [49] stringr_1.4.0           ps_1.6.0                mime_0.12               miniUI_0.1.1.1         
 [53] lifecycle_1.0.1         phangorn_2.9.0          gtools_3.9.2            VoCC_1.0.0             
 [57] terra_1.5-34            MASS_7.3-54             zoo_1.8-10              scales_1.2.0           
 [61] colourpicker_1.1.1      promises_1.2.0.1        Brobdingnag_1.2-7       parallel_4.1.1         
 [65] inline_0.3.19           expm_0.999-6            shinystan_2.6.0         RColorBrewer_1.1-3     
 [69] yaml_2.3.5              geosphere_1.5-14        gridExtra_2.3           loo_2.5.1              
 [73] StanHeaders_2.21.0-7    stringi_1.7.6           dygraphs_1.1.1.6        foreach_1.5.2          
 [77] plotrix_3.8-2           checkmate_2.0.0         phytools_1.0-3          boot_1.3-28            
 [81] pkgbuild_1.3.1          rlang_1.0.3             pkgconfig_2.0.3         matrixStats_0.62.0     
 [85] distributional_0.3.0    evaluate_0.15           lattice_0.20-44         purrr_0.3.4            
 [89] rstantools_2.2.0        htmlwidgets_1.5.4       cowplot_1.1.1           processx_3.5.2         
 [93] tidyselect_1.1.1        plyr_1.8.7              magrittr_2.0.3          R6_2.5.1               
 [97] generics_0.1.0          combinat_0.0-8          DBI_1.1.1               withr_2.5.0            
[101] pillar_1.7.0            xts_0.12.1              scatterplot3d_0.3-41    abind_1.4-5            
[105] sp_1.5-0                tibble_3.1.7            crayon_1.5.1            utf8_1.2.2             
[109] rmarkdown_2.14          grid_4.1.1              data.table_1.14.2       callr_3.7.0            
[113] threejs_0.3.3           CircStats_0.2-6         digest_0.6.29           gdistance_1.3-6        
[117] xtable_1.8-4            httpuv_1.6.5            numDeriv_2016.8-1.1     RcppParallel_5.1.5     
[121] stats4_4.1.1            munsell_0.5.0           quadprog_1.5-8          shinyjs_2.1.0          

I would be immensely grateful if anyone would be able to help me with this!
Thank you in advance, Jakob

I think I’ve come up with a workaround:

Looking through the source code for chkpt_brms (link) it looks like stan_code_path only gets defined if the file /stan_model/model.stan doesn’t already exist as seen in the following section of code:

If you are resuming fitting then this file will already exist and stan_code_path won’t get defined. Adding the following line before attempting to resume fitting seems to solve the problem:
stan_code_path <- paste0(path, "/stan_model/model.stan")

This seems like a bug in the code to me, maybe something that needs looking at?