Hey, thanks for looking at this.
I think the goal should be to have something like:
./model sampling output checkpoint_file=check.csv
And if someone cntrl+Cs the process, then we can run the same command and the model realizes it has a checkpoint file and just keeps running from where it left off.
I’d like to make the assumptions:
-
Behavior should be undefined if the model/data changes. Definitely we don’t want the model/data to change, but we don’t have any way to check.
-
A fixed seed does not guarantee the same output if checkpointing is used.
I think something similar can be accomplished right now with the output diagnostic_file=file.csv
option in cmdstan and a bit of wrapping. Given you have a need for this, I’d recommend you build this all in your external scripts and see if it works for you before we try to do any sort of formal checkpointing in Stan itself.
The easy case is if warmup finishes.
We need to provide the sampler three things to get it going:
- An inverse metric
- A timestep
- A place to start
If that’s the case, we can get an estimate of the posterior covariance of the unconstrained parameters from the diagnostic file. This matrix we use as the inverse metric (no additional inverting or anything necessary – the unconstrained posterior covariance is what we want).
We can get the adapted timestep from the output file.
We can use as the place to start the last output sample that printed fully.
One thing I left off is which samples from the diagnostic file do we use to compute the covariance of the unconstrained parameters? That’s where things get tricky even in the easy case!
By default warmup looks like:
init_buffer | series of adaptation windows | term_buffer
init_buffer is default 75 draws and term buffer is by default 50 draws.
By default the series of adaptation windows are (in number of draws):
25 | 50 | 100 | 200 | 400
Or in terms of draw numbers:
76-100 | 101-150 | 151-250 | 251-450 | 451-950
At the end of each of the adaptation windows the inverse metric is set to a regularized posterior covariance estimate from that window.
So to restart for something finished with warmup, use the metric from unconstrained draws 451-950, the timestep from 1000, and initialize with the last successfully printed draw.
If warmup doesn’t finish by the time adaptation ends, what I suggest we do is restart at the end of the last completed warmup window (the schedule I showed above isn’t fixed, but it can be computed as a function of the input parameters to cmdstan).
So the basic steps:
- Discard the unconstrained samples since the last window
- Compute the posterior covariance approximation from the last window and set the inverse metric
- Initialize a timestep from the last window
- Set init_buffer = 0 if past the initial buffer
- Adjust window_size to be twice as large as the last window (this is the standard way the window sizes grows)
- Set the initial sample to the last draw of the last window
Some other thoughts:
- Instead of throwing away the samples since the last window we could just use them – this’d mess with repeatability
- Don’t worry about timestep adaptation. The dual averaging thing settles pretty quickly. No need to save or recover its state.
So if we canceled a process at warmup draw 301 we could do:
- Set inv_metric to unconstrained covariance of samples 151 to 250
- Set timestep to the timestep from step 250
- Set init_buffer = 0
- Set window_size = 200 (last window was size 100)
- Set init to the value of the parameters from step 250
I guess the most frustrating thing about this method is that if we cancel anywhere in the range 451-950 then we lose those samples. I think to do better than this though we’d have to start modifying Stan and changing warmup (which isn’t out of the cards, I’m just trying to point out what is easy here).