Chkptstanr: checkpoint MCMC sampling in Stan

We recently had our package, chkptstanr, accepted at CRAN !

The basic idea is to start and stop the sampler, as needed.

The package was actually a request from AWS, in that they asked us to make some functionality for using Stan with their so-called “spot instances” (this can reduce the cost considerably).

We followed a suggestion on this forum, in particular, from @Bob_Carpenter:

"You’ll need step size, the mass matrix or metric (making sure to get the inversion right), and the last draw to use as an initialization. Then you need to configure NUTS to run with no warmup and just keep using the step size and mass matrix you provide " (Current state of checkpointing in Stan)

This is now done “under the hood”, so the overall user experience is much like using Stan or brms. In fact, the package is compatible with brms (and posterior, bayesplot, etc…), in that, internally, the Stan code is generated, then fitted with cmdstanr, and then the returned object is of class brmsfit. This was important for us, because now all the other brms functions can be used seamlessly (e.g., pp_check()).

There are some caveats we came across when developing the package:

(1) there is quite a bit of overhead for extracting the information, saving, etc… So this can make model fitting take much longer, so need to consider just how many checkpoints are needed.

(2) we found that there must be an initial period that cannot be interrupted. Once past this, at least in our tests, it is very similar to just fitting without stopping.

8 Likes

Hi @donny,

This is absolutely great to see, being the original author of that question and a big proponent of opportunistic computing.

I do have a few questions if you dont mind.

  • Do you have any figures of merit for the number of samples to save before checkpointing? I realize this is contingent on data size and model complexity.
  • How long is that initial period mentioned in (2)? Presumably this is number of samples…
  • Do you get the same (exact) results from checkpointed sample as from a chain that is just allowed to run to the end?
  • Do you have any figures of merit for the number of samples to save before checkpointing? I realize this is contingent on data size and model complexity.

We don’t. We often have millions of rows, MLM with many “random” (or varying) effects. In our tests, we found that 150 to 200 seemed to work nicely, as mention on a different Stan forum post about finding the “typical set”.

That said, I plan to make a vignette about just this issue to show what can happen…

  • How long is that initial period mentioned in (2)? Presumably this is number of samples…

Over 100, and I bet it does depend on model complexity, etc.

  • Do you get the same (exact) results from checkpointed sample as from a chain that is just allowed to run to the end?

I cannot say if it is “exact”. But we found that the checkpointed samples (and summaries therein) where very (very) similar to a model that was allowed to run to the end. Pretty sure there is an example in the brms vignette that also includes a model that was allowed to run to the end.

2 Likes