Current state of checkpointing in Stan

I would be interested in working on this once I get settled in at Aalto, though the time available is conditional upon which tasks from FCAI staff have highest priority. Would it make sense to show/implement this in CmdStan first or one of the others?

Cheers,
Matt

Just noticed this thread on output serialization future options, which seems to have grown out of some
thoughts from @rok_cesnovar partly about restarting failed sampling runs.

This discussion has been ongoing ever since we had more than one developer on the project :-)

1 Like

I brought up the topic of implementing checkpointing in cmdStan on last weeks StanDev call and folks seemed pretty positive about the idea. @jonah asked me to resurrect this thread to get some more community feedback on the idea. As I am new to Stan-Dev, it’s not clear to me how straightforward this implementation is, but I believe it will be a feature many users will be happy to have, particularly those who are running time limited jobs on local clusters.

Proposal bullet points

  • Implement in cmdStan such that it doesn’t need additional packages to work
  • Write Mass Matrix to streamed CSV if not done already
  • Require cmdStan to use recognized CSV file in output directory rather than restart from scratch if checkpointing flag is enabled, assuming file exists.

Questions in short term

  • Does the random SEED matter at the time of the last sample matter?
  • Is it worth storing the warm-up results in a separate file?
  • How does streaming results work when there are multiple chains? Separate files?

I need to read up on how SLURM and HTCondor DAG managers work and make sure this proposal doesn’t get tripped up in how ejected and restarted jobs are handled.

2 Likes

Hey, thanks for looking at this.

I think the goal should be to have something like:

./model sampling output checkpoint_file=check.csv

And if someone cntrl+Cs the process, then we can run the same command and the model realizes it has a checkpoint file and just keeps running from where it left off.

I’d like to make the assumptions:

  1. Behavior should be undefined if the model/data changes. Definitely we don’t want the model/data to change, but we don’t have any way to check.

  2. A fixed seed does not guarantee the same output if checkpointing is used.

I think something similar can be accomplished right now with the output diagnostic_file=file.csv option in cmdstan and a bit of wrapping. Given you have a need for this, I’d recommend you build this all in your external scripts and see if it works for you before we try to do any sort of formal checkpointing in Stan itself.


The easy case is if warmup finishes.

We need to provide the sampler three things to get it going:

  1. An inverse metric
  2. A timestep
  3. A place to start

If that’s the case, we can get an estimate of the posterior covariance of the unconstrained parameters from the diagnostic file. This matrix we use as the inverse metric (no additional inverting or anything necessary – the unconstrained posterior covariance is what we want).

We can get the adapted timestep from the output file.

We can use as the place to start the last output sample that printed fully.

One thing I left off is which samples from the diagnostic file do we use to compute the covariance of the unconstrained parameters? That’s where things get tricky even in the easy case!


By default warmup looks like:

init_buffer | series of adaptation windows | term_buffer

init_buffer is default 75 draws and term buffer is by default 50 draws.

By default the series of adaptation windows are (in number of draws):

25 | 50 | 100 | 200 | 400

Or in terms of draw numbers:

76-100 | 101-150 | 151-250 | 251-450 | 451-950

At the end of each of the adaptation windows the inverse metric is set to a regularized posterior covariance estimate from that window.

So to restart for something finished with warmup, use the metric from unconstrained draws 451-950, the timestep from 1000, and initialize with the last successfully printed draw.


If warmup doesn’t finish by the time adaptation ends, what I suggest we do is restart at the end of the last completed warmup window (the schedule I showed above isn’t fixed, but it can be computed as a function of the input parameters to cmdstan).

So the basic steps:

  1. Discard the unconstrained samples since the last window
  2. Compute the posterior covariance approximation from the last window and set the inverse metric
  3. Initialize a timestep from the last window
  4. Set init_buffer = 0 if past the initial buffer
  5. Adjust window_size to be twice as large as the last window (this is the standard way the window sizes grows)
  6. Set the initial sample to the last draw of the last window

Some other thoughts:

  1. Instead of throwing away the samples since the last window we could just use them – this’d mess with repeatability
  2. Don’t worry about timestep adaptation. The dual averaging thing settles pretty quickly. No need to save or recover its state.

So if we canceled a process at warmup draw 301 we could do:

  1. Set inv_metric to unconstrained covariance of samples 151 to 250
  2. Set timestep to the timestep from step 250
  3. Set init_buffer = 0
  4. Set window_size = 200 (last window was size 100)
  5. Set init to the value of the parameters from step 250

I guess the most frustrating thing about this method is that if we cancel anywhere in the range 451-950 then we lose those samples. I think to do better than this though we’d have to start modifying Stan and changing warmup (which isn’t out of the cards, I’m just trying to point out what is easy here).

6 Likes

I’d also add that there was length discussion on how to best handle various forms of output from Stan, including the ones necessary for checkpointing at Proposal for consolidated output (unfortunately it never got anywhere in the end but maybe worth checking out just to avoid rehashing the same arguments)

Hope you succeed in moving this forward!

1 Like

The state of our sampler’s larger than the seed, so it’s not easy to get an exact restart. But it shouldn’t matter.

That would be really great if we could achieve this, especially if we could analyze the intermediate output.

So quickly I’ve come to believe it’s governed by strange magic. I’ll try to digest Francis Bach’s tutorials, which seem relevant.

Why throw them away? We’d have the metric and step size to keep going, right?

I guess I was thinking it might be annoying to load them back up into memory. But that shouldn’t be too hard, I suppose.

There was a thread recently discussing problems with timestep adaptation: Issue with dual averaging - #35 by monnahc

1 Like

An HPC site mentioned the following thingy, when I had asked about CRIU support (with Stan in mind),

DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk a distributed computation. It works under Linux, with no modifications to the Linux kernel nor to the application binaries. It can be used by unprivileged users (no root privilege needed). One can later restart from a checkpoint, or even migrate the processes by moving the checkpoint files to another host prior to restarting.

From the getting started it looks like a checkpoint and restart would only require

# start the model
dmtcp_launch ./model sample $args &
sleep 60  # or whatever walltime limit

# checkpoint and stop
dmtcp_command --checkpoint
dmtcp_command --kill

# later restart
dmtcp_restart ckpt_a.out_*.dmtcp &

Assuming no performance issues (!) this could be a good way to run models on HPC sites with walltime limits.

4 Likes

dmtcp seems like an incredible “just works” solution for checkpointing.

Unfortunately, reality seems less magical. About a year ago I tried very hard to use dmtcp for check-pointing long running Rstan and Rstanarm models and ran into some serious stability issues in dmtcp (i.e. segmentation faults) that rendered it unusable for that purpose. Please let me know if you have a different experience. I think it might work better if it has plenty of memory overhead.

4 Likes

Thanks for the reality check. It’s unsurprising something like this would have sharp edges. I will try to test it again and also CRIU (did you try that as well?) and see what I come up with.

1 Like

I did not try CRIU

I tried both now with CmdStan 2.17 (need to update…) and they checkpoint and restore successfully, except CRIU needs root to restore (though maybe a more recent version could restore as non root).

Did you have segfaults immediately or only when you tried longer runs?

2 Likes

Here’s the bug report I sent to dmtcp: https://sourceforge.net/p/dmtcp/mailman/message/36959396/

I would get crashes in a sequence of checkpoint -> restore -> checkpoint -> restore
The second or (sometimes third) checkpoint would fail.
My use case was fitting models on an slurm backfill queue so I needed to be able to checkpoint and restore repeatedly without issue.

3 Likes

Does anyone want to collaborate on writing a version of the algorithm that enables checkpointing?

I believe in order to do this right, we actually have to rewrite the algorithm so that it has a clearly defined state that can be serialized and deserialized. There aren’t too many pieces state that we need to capture, but the way it’s designed and implemented now, it’s going to be really hard to deserialize that information and plug it back into the algorithm where it matters.

If anyone would like to help, we can start a thread on just that and work towards an implementation that works.

4 Likes

Not sure how much help I can be as my c++ is very rusty. But if successful, this effort could greatly benefit me so I’ll follow along and try to support if I can.

1 Like

It would be great to get it working, but there seem to be a handful of design decisions to make prior to that: should the checkpoint format be stable? In a known format or just fwrite a struct from memory? Should check pointing be threaded through the various APIs or just a signal handler in CmdStan? Should a checkpoint correspond to an accepted proposal or a leapfrog step? How to handle continuing a checkpoint if the output CSV disappears? Etc

Things like CRIU provide a solution at a lower level of abstraction where these questions go away or answered already. If the platform supports CRIU then it’s hard to get motivated to write up rationale for all the questions above, much less make it harder to maintain the algorithms themselves.

2 Likes

Are we talking here having access to prng state (or just advancing it)

Sample 10 -> Sample 10 -> Sample 10 == Sample 30

And then rest of the work is for the user the be sure that metric + stepsizes / inits are included.

Also assuming that warmup is not considered here?

I think check-pointing during warmup is pretty important!

2 Likes

That’s a really neat project. I was thinking about it a little differently. I’ll post thoughts up soon.

@groceryheist, I’m also thinking about check-pointing during warmup. And any help, even if it’s evaluating the goals, is useful.

1 Like