Current state of checkpointing in Stan

bbbales2 · March 9, 2020, 5:47pm

Hey, thanks for looking at this.

I think the goal should be to have something like:

./model sampling output checkpoint_file=check.csv

And if someone cntrl+Cs the process, then we can run the same command and the model realizes it has a checkpoint file and just keeps running from where it left off.

I’d like to make the assumptions:

Behavior should be undefined if the model/data changes. Definitely we don’t want the model/data to change, but we don’t have any way to check.
A fixed seed does not guarantee the same output if checkpointing is used.

I think something similar can be accomplished right now with the output diagnostic_file=file.csv option in cmdstan and a bit of wrapping. Given you have a need for this, I’d recommend you build this all in your external scripts and see if it works for you before we try to do any sort of formal checkpointing in Stan itself.

The easy case is if warmup finishes.

We need to provide the sampler three things to get it going:

An inverse metric
A timestep
A place to start

If that’s the case, we can get an estimate of the posterior covariance of the unconstrained parameters from the diagnostic file. This matrix we use as the inverse metric (no additional inverting or anything necessary – the unconstrained posterior covariance is what we want).

We can get the adapted timestep from the output file.

We can use as the place to start the last output sample that printed fully.

One thing I left off is which samples from the diagnostic file do we use to compute the covariance of the unconstrained parameters? That’s where things get tricky even in the easy case!

By default warmup looks like:

init_buffer | series of adaptation windows | term_buffer

init_buffer is default 75 draws and term buffer is by default 50 draws.

By default the series of adaptation windows are (in number of draws):

25 | 50 | 100 | 200 | 400

Or in terms of draw numbers:

76-100 | 101-150 | 151-250 | 251-450 | 451-950

At the end of each of the adaptation windows the inverse metric is set to a regularized posterior covariance estimate from that window.

So to restart for something finished with warmup, use the metric from unconstrained draws 451-950, the timestep from 1000, and initialize with the last successfully printed draw.

If warmup doesn’t finish by the time adaptation ends, what I suggest we do is restart at the end of the last completed warmup window (the schedule I showed above isn’t fixed, but it can be computed as a function of the input parameters to cmdstan).

So the basic steps:

Discard the unconstrained samples since the last window
Compute the posterior covariance approximation from the last window and set the inverse metric
Initialize a timestep from the last window
Set init_buffer = 0 if past the initial buffer
Adjust window_size to be twice as large as the last window (this is the standard way the window sizes grows)
Set the initial sample to the last draw of the last window

Some other thoughts:

Instead of throwing away the samples since the last window we could just use them – this’d mess with repeatability
Don’t worry about timestep adaptation. The dual averaging thing settles pretty quickly. No need to save or recover its state.

So if we canceled a process at warmup draw 301 we could do:

Set inv_metric to unconstrained covariance of samples 151 to 250
Set timestep to the timestep from step 250
Set init_buffer = 0
Set window_size = 200 (last window was size 100)
Set init to the value of the parameters from step 250

I guess the most frustrating thing about this method is that if we cancel anywhere in the range 451-950 then we lose those samples. I think to do better than this though we’d have to start modifying Stan and changing warmup (which isn’t out of the cards, I’m just trying to point out what is easy here).

Topic		Replies	Views
Checkpointing CmdStan sampling Interfaces	24	2346	June 30, 2020
Chkptstanr: checkpoint MCMC sampling in Stan General	10	1497	October 1, 2024
Which Stan interface supports checkpointing? General cmdstan , rstan , stan , cmdstanr	6	992	May 5, 2021
Chkptstanr v0.2.0-alpha: checkpoint brms and cmdstanr sampling Publicity techniques , cmdstanr , brms	6	474	February 29, 2024
Saving & reusing adaptation in cmdstanr Interfaces cmdstanr	53	4318	June 8, 2022

Current state of checkpointing in Stan

Related topics