Current state of checkpointing in Stan

As checkpointing seems to have multiple meanings in computing these days, I mean periodically saving job progress to allow one to restart jobs on a batch system that may have stopped due to priority issue or transient failure.

The other post on this topic had its last few replies in February of this year and I was curious if anything further had been done to bring this functionality to CmdStan or any of the other interfaces. The replies from various CoreTeam developers does make it clear that doing this at any sampling stage is a non-trivial task so I can understand that this may be deep in the queue of feature requests.

I ask because I mostly have worked with large parallelizable workflows on batch (HTCondor/LSF) systems and having checkpointing in Stan would be a good selling point to some former colleagues over in astronomy.

2 Likes

We work a lot with Slurm systems where job walltime is limited to 24h typically, and some of our models take weeks to run. We saw two approaches available, as the core Stan code doesn’t seem amenable to explicit checkpointing:

(1) run the model for some number of iterations which does fit in the walltime limit, take the output as an intialization for another job, and iterate until completion. This is challenging for warmup since you have to implement the schedule outside of Stan but seems to work

(2) use something like CRIU which is a Linux-kernel dependent technique to dump the process to files and restart it in a second job.

These are different from typical checkpointing in HPC since usually you would checkpoint and continue, but still may be applicable for your case.

1 Like

Have you considered something similar as in https://github.com/ahartikainen/fit_check_fit_loop

It’s implemented with PyStan+ArviZ, but the same thing can be done with CmdStan (and for diagnostics you can use ArviZ or manually combine csv files + RStan monitor)

1 Like

Yes, I forgot to mention that PyStan now seems to implement very helpful APIs for doing this, but it wasn’t the case back when (~2017) I had to implement. ArviZ looks really nice btw.

This hasn’t been a big priority for us, so I don’t know that anything’s been done.

After warmup, it’s easy to restart given the adapted mass matrix/metric and step size and last recorded position. Before warmup has finished, there’s much more state to maintain around the adaptation.

I’ve been searching for a good writeup of exactly how to do this and coming up empty handed. I believe you that it’s easy, but at the moment how to go about this is, to me, opaque. Is this something that can be done via the rstan interface?

Does anyone have internal or external documentation about how to do this? I’m using rstan on a Slurm HPC system.

By the by, Some sort of checkpointing would also be super helpful for using AWS spot instances.

2 Likes

I think this can be done with PyStan and CmdStan (+CmdStanPy +CmdStanR).

I think it would be good idea to have some kind of tutorial showing all of these steps.

Perhaps I shouldn’t have said “easy”—it’s an involved process that we don’t have well documented. I’ve never tried it.

If you’re working in R, here’s the doc on how to extract info from the output:

https://cran.r-project.org/web/packages/rstan/vignettes/stanfit-objects.html

You’ll need step size, the mass matrix or metric (making sure to get the inversion right), and the last draw to use as an initialization. Then you need to configure NUTS to run with no warmup and just keep using the step size and mass matrix you provide.

It would be nice if we automated all this with checkpointing. It’d be a great contribution if someone wants to do it.

2 Likes

I would be interested in working on this once I get settled in at Aalto, though the time available is conditional upon which tasks from FCAI staff have highest priority. Would it make sense to show/implement this in CmdStan first or one of the others?

Cheers,
Matt

Just noticed this thread on output serialization future options, which seems to have grown out of some
thoughts from @rok_cesnovar partly about restarting failed sampling runs.

This discussion has been ongoing ever since we had more than one developer on the project :-)

1 Like

I brought up the topic of implementing checkpointing in cmdStan on last weeks StanDev call and folks seemed pretty positive about the idea. @jonah asked me to resurrect this thread to get some more community feedback on the idea. As I am new to Stan-Dev, it’s not clear to me how straightforward this implementation is, but I believe it will be a feature many users will be happy to have, particularly those who are running time limited jobs on local clusters.

Proposal bullet points

  • Implement in cmdStan such that it doesn’t need additional packages to work
  • Write Mass Matrix to streamed CSV if not done already
  • Require cmdStan to use recognized CSV file in output directory rather than restart from scratch if checkpointing flag is enabled, assuming file exists.

Questions in short term

  • Does the random SEED matter at the time of the last sample matter?
  • Is it worth storing the warm-up results in a separate file?
  • How does streaming results work when there are multiple chains? Separate files?

I need to read up on how SLURM and HTCondor DAG managers work and make sure this proposal doesn’t get tripped up in how ejected and restarted jobs are handled.

1 Like

Hey, thanks for looking at this.

I think the goal should be to have something like:

./model sampling output checkpoint_file=check.csv

And if someone cntrl+Cs the process, then we can run the same command and the model realizes it has a checkpoint file and just keeps running from where it left off.

I’d like to make the assumptions:

  1. Behavior should be undefined if the model/data changes. Definitely we don’t want the model/data to change, but we don’t have any way to check.

  2. A fixed seed does not guarantee the same output if checkpointing is used.

I think something similar can be accomplished right now with the output diagnostic_file=file.csv option in cmdstan and a bit of wrapping. Given you have a need for this, I’d recommend you build this all in your external scripts and see if it works for you before we try to do any sort of formal checkpointing in Stan itself.


The easy case is if warmup finishes.

We need to provide the sampler three things to get it going:

  1. An inverse metric
  2. A timestep
  3. A place to start

If that’s the case, we can get an estimate of the posterior covariance of the unconstrained parameters from the diagnostic file. This matrix we use as the inverse metric (no additional inverting or anything necessary – the unconstrained posterior covariance is what we want).

We can get the adapted timestep from the output file.

We can use as the place to start the last output sample that printed fully.

One thing I left off is which samples from the diagnostic file do we use to compute the covariance of the unconstrained parameters? That’s where things get tricky even in the easy case!


By default warmup looks like:

init_buffer | series of adaptation windows | term_buffer

init_buffer is default 75 draws and term buffer is by default 50 draws.

By default the series of adaptation windows are (in number of draws):

25 | 50 | 100 | 200 | 400

Or in terms of draw numbers:

76-100 | 101-150 | 151-250 | 251-450 | 451-950

At the end of each of the adaptation windows the inverse metric is set to a regularized posterior covariance estimate from that window.

So to restart for something finished with warmup, use the metric from unconstrained draws 451-950, the timestep from 1000, and initialize with the last successfully printed draw.


If warmup doesn’t finish by the time adaptation ends, what I suggest we do is restart at the end of the last completed warmup window (the schedule I showed above isn’t fixed, but it can be computed as a function of the input parameters to cmdstan).

So the basic steps:

  1. Discard the unconstrained samples since the last window
  2. Compute the posterior covariance approximation from the last window and set the inverse metric
  3. Initialize a timestep from the last window
  4. Set init_buffer = 0 if past the initial buffer
  5. Adjust window_size to be twice as large as the last window (this is the standard way the window sizes grows)
  6. Set the initial sample to the last draw of the last window

Some other thoughts:

  1. Instead of throwing away the samples since the last window we could just use them – this’d mess with repeatability
  2. Don’t worry about timestep adaptation. The dual averaging thing settles pretty quickly. No need to save or recover its state.

So if we canceled a process at warmup draw 301 we could do:

  1. Set inv_metric to unconstrained covariance of samples 151 to 250
  2. Set timestep to the timestep from step 250
  3. Set init_buffer = 0
  4. Set window_size = 200 (last window was size 100)
  5. Set init to the value of the parameters from step 250

I guess the most frustrating thing about this method is that if we cancel anywhere in the range 451-950 then we lose those samples. I think to do better than this though we’d have to start modifying Stan and changing warmup (which isn’t out of the cards, I’m just trying to point out what is easy here).

5 Likes

I’d also add that there was length discussion on how to best handle various forms of output from Stan, including the ones necessary for checkpointing at Proposal for consolidated output (unfortunately it never got anywhere in the end but maybe worth checking out just to avoid rehashing the same arguments)

Hope you succeed in moving this forward!

1 Like

The state of our sampler’s larger than the seed, so it’s not easy to get an exact restart. But it shouldn’t matter.

That would be really great if we could achieve this, especially if we could analyze the intermediate output.

So quickly I’ve come to believe it’s governed by strange magic. I’ll try to digest Francis Bach’s tutorials, which seem relevant.

Why throw them away? We’d have the metric and step size to keep going, right?

I guess I was thinking it might be annoying to load them back up into memory. But that shouldn’t be too hard, I suppose.

There was a thread recently discussing problems with timestep adaptation: Issue with dual averaging