Current state of checkpointing in Stan

mtwest · December 17, 2019, 8:03am

As checkpointing seems to have multiple meanings in computing these days, I mean periodically saving job progress to allow one to restart jobs on a batch system that may have stopped due to priority issue or transient failure.

The other post on this topic had its last few replies in February of this year and I was curious if anything further had been done to bring this functionality to CmdStan or any of the other interfaces. The replies from various CoreTeam developers does make it clear that doing this at any sampling stage is a non-trivial task so I can understand that this may be deep in the queue of feature requests.

I ask because I mostly have worked with large parallelizable workflows on batch (HTCondor/LSF) systems and having checkpointing in Stan would be a good selling point to some former colleagues over in astronomy.

maedoc · December 17, 2019, 8:44am

We work a lot with Slurm systems where job walltime is limited to 24h typically, and some of our models take weeks to run. We saw two approaches available, as the core Stan code doesn’t seem amenable to explicit checkpointing:

(1) run the model for some number of iterations which does fit in the walltime limit, take the output as an intialization for another job, and iterate until completion. This is challenging for warmup since you have to implement the schedule outside of Stan but seems to work

(2) use something like CRIU which is a Linux-kernel dependent technique to dump the process to files and restart it in a second job.

These are different from typical checkpointing in HPC since usually you would checkpoint and continue, but still may be applicable for your case.

ahartikainen · December 17, 2019, 10:03am

Have you considered something similar as in https://github.com/ahartikainen/fit_check_fit_loop

It’s implemented with PyStan+ArviZ, but the same thing can be done with CmdStan (and for diagnostics you can use ArviZ or manually combine csv files + RStan monitor)

maedoc · December 17, 2019, 10:06am

Yes, I forgot to mention that PyStan now seems to implement very helpful APIs for doing this, but it wasn’t the case back when (~2017) I had to implement. ArviZ looks really nice btw.

Bob_Carpenter · December 19, 2019, 11:57pm

This hasn’t been a big priority for us, so I don’t know that anything’s been done.

After warmup, it’s easy to restart given the adapted mass matrix/metric and step size and last recorded position. Before warmup has finished, there’s much more state to maintain around the adaptation.

jflournoy · January 27, 2020, 4:32pm

I’ve been searching for a good writeup of exactly how to do this and coming up empty handed. I believe you that it’s easy, but at the moment how to go about this is, to me, opaque. Is this something that can be done via the rstan interface?

Does anyone have internal or external documentation about how to do this? I’m using rstan on a Slurm HPC system.

By the by, Some sort of checkpointing would also be super helpful for using AWS spot instances.

ahartikainen · January 27, 2020, 4:50pm

I think this can be done with PyStan and CmdStan (+CmdStanPy +CmdStanR).

I think it would be good idea to have some kind of tutorial showing all of these steps.

Bob_Carpenter · January 28, 2020, 1:25am

Perhaps I shouldn’t have said “easy”—it’s an involved process that we don’t have well documented. I’ve never tried it.

If you’re working in R, here’s the doc on how to extract info from the output:

https://cran.r-project.org/web/packages/rstan/vignettes/stanfit-objects.html

You’ll need step size, the mass matrix or metric (making sure to get the inversion right), and the last draw to use as an initialization. Then you need to configure NUTS to run with no warmup and just keep using the step size and mass matrix you provide.

It would be nice if we automated all this with checkpointing. It’d be a great contribution if someone wants to do it.

mtwest · January 29, 2020, 12:56am

I would be interested in working on this once I get settled in at Aalto, though the time available is conditional upon which tasks from FCAI staff have highest priority. Would it make sense to show/implement this in CmdStan first or one of the others?

Cheers,
Matt

mtwest · January 29, 2020, 3:07am

Just noticed this thread on output serialization future options, which seems to have grown out of some
thoughts from @rok_cesnovar partly about restarting failed sampling runs.

Bob_Carpenter · February 7, 2020, 11:31pm

This discussion has been ongoing ever since we had more than one developer on the project :-)

mtwest · March 9, 2020, 1:22pm

I brought up the topic of implementing checkpointing in cmdStan on last weeks StanDev call and folks seemed pretty positive about the idea. @jonah asked me to resurrect this thread to get some more community feedback on the idea. As I am new to Stan-Dev, it’s not clear to me how straightforward this implementation is, but I believe it will be a feature many users will be happy to have, particularly those who are running time limited jobs on local clusters.

Proposal bullet points

Implement in cmdStan such that it doesn’t need additional packages to work
Write Mass Matrix to streamed CSV if not done already
Require cmdStan to use recognized CSV file in output directory rather than restart from scratch if checkpointing flag is enabled, assuming file exists.

Questions in short term

Does the random SEED matter at the time of the last sample matter?
Is it worth storing the warm-up results in a separate file?
How does streaming results work when there are multiple chains? Separate files?

I need to read up on how SLURM and HTCondor DAG managers work and make sure this proposal doesn’t get tripped up in how ejected and restarted jobs are handled.

bbbales2 · March 9, 2020, 5:47pm

Hey, thanks for looking at this.

I think the goal should be to have something like:

./model sampling output checkpoint_file=check.csv

And if someone cntrl+Cs the process, then we can run the same command and the model realizes it has a checkpoint file and just keeps running from where it left off.

I’d like to make the assumptions:

Behavior should be undefined if the model/data changes. Definitely we don’t want the model/data to change, but we don’t have any way to check.
A fixed seed does not guarantee the same output if checkpointing is used.

I think something similar can be accomplished right now with the output diagnostic_file=file.csv option in cmdstan and a bit of wrapping. Given you have a need for this, I’d recommend you build this all in your external scripts and see if it works for you before we try to do any sort of formal checkpointing in Stan itself.

The easy case is if warmup finishes.

We need to provide the sampler three things to get it going:

An inverse metric
A timestep
A place to start

If that’s the case, we can get an estimate of the posterior covariance of the unconstrained parameters from the diagnostic file. This matrix we use as the inverse metric (no additional inverting or anything necessary – the unconstrained posterior covariance is what we want).

We can get the adapted timestep from the output file.

We can use as the place to start the last output sample that printed fully.

One thing I left off is which samples from the diagnostic file do we use to compute the covariance of the unconstrained parameters? That’s where things get tricky even in the easy case!

By default warmup looks like:

init_buffer | series of adaptation windows | term_buffer

init_buffer is default 75 draws and term buffer is by default 50 draws.

By default the series of adaptation windows are (in number of draws):

25 | 50 | 100 | 200 | 400

Or in terms of draw numbers:

76-100 | 101-150 | 151-250 | 251-450 | 451-950

At the end of each of the adaptation windows the inverse metric is set to a regularized posterior covariance estimate from that window.

So to restart for something finished with warmup, use the metric from unconstrained draws 451-950, the timestep from 1000, and initialize with the last successfully printed draw.

If warmup doesn’t finish by the time adaptation ends, what I suggest we do is restart at the end of the last completed warmup window (the schedule I showed above isn’t fixed, but it can be computed as a function of the input parameters to cmdstan).

So the basic steps:

Discard the unconstrained samples since the last window
Compute the posterior covariance approximation from the last window and set the inverse metric
Initialize a timestep from the last window
Set init_buffer = 0 if past the initial buffer
Adjust window_size to be twice as large as the last window (this is the standard way the window sizes grows)
Set the initial sample to the last draw of the last window

Some other thoughts:

Instead of throwing away the samples since the last window we could just use them – this’d mess with repeatability
Don’t worry about timestep adaptation. The dual averaging thing settles pretty quickly. No need to save or recover its state.

So if we canceled a process at warmup draw 301 we could do:

Set inv_metric to unconstrained covariance of samples 151 to 250
Set timestep to the timestep from step 250
Set init_buffer = 0
Set window_size = 200 (last window was size 100)
Set init to the value of the parameters from step 250

I guess the most frustrating thing about this method is that if we cancel anywhere in the range 451-950 then we lose those samples. I think to do better than this though we’d have to start modifying Stan and changing warmup (which isn’t out of the cards, I’m just trying to point out what is easy here).

martinmodrak · March 13, 2020, 2:28pm

I’d also add that there was length discussion on how to best handle various forms of output from Stan, including the ones necessary for checkpointing at Proposal for consolidated output (unfortunately it never got anywhere in the end but maybe worth checking out just to avoid rehashing the same arguments)

Hope you succeed in moving this forward!

Bob_Carpenter · March 19, 2020, 7:10pm

The state of our sampler’s larger than the seed, so it’s not easy to get an exact restart. But it shouldn’t matter.

That would be really great if we could achieve this, especially if we could analyze the intermediate output.

So quickly I’ve come to believe it’s governed by strange magic. I’ll try to digest Francis Bach’s tutorials, which seem relevant.

Why throw them away? We’d have the metric and step size to keep going, right?

bbbales2 · March 19, 2020, 7:13pm

I guess I was thinking it might be annoying to load them back up into memory. But that shouldn’t be too hard, I suppose.

There was a thread recently discussing problems with timestep adaptation: Issue with dual averaging - #35 by monnahc

maedoc · November 12, 2020, 5:57pm

An HPC site mentioned the following thingy, when I had asked about CRIU support (with Stan in mind),

DMTCP Checkpoint/Restart allows one to transparently checkpoint to disk a distributed computation. It works under Linux, with no modifications to the Linux kernel nor to the application binaries. It can be used by unprivileged users (no root privilege needed). One can later restart from a checkpoint, or even migrate the processes by moving the checkpoint files to another host prior to restarting.

From the getting started it looks like a checkpoint and restart would only require

# start the model
dmtcp_launch ./model sample $args &
sleep 60  # or whatever walltime limit

# checkpoint and stop
dmtcp_command --checkpoint
dmtcp_command --kill

# later restart
dmtcp_restart ckpt_a.out_*.dmtcp &

Assuming no performance issues (!) this could be a good way to run models on HPC sites with walltime limits.

groceryheist · November 12, 2020, 6:30pm

dmtcp seems like an incredible “just works” solution for checkpointing.

Unfortunately, reality seems less magical. About a year ago I tried very hard to use dmtcp for check-pointing long running Rstan and Rstanarm models and ran into some serious stability issues in dmtcp (i.e. segmentation faults) that rendered it unusable for that purpose. Please let me know if you have a different experience. I think it might work better if it has plenty of memory overhead.

maedoc · November 12, 2020, 8:22pm

Thanks for the reality check. It’s unsurprising something like this would have sharp edges. I will try to test it again and also CRIU (did you try that as well?) and see what I come up with.

groceryheist · November 12, 2020, 8:50pm

I did not try CRIU

Topic		Replies	Views
Checkpointing with CmdStanPy General	2	550	September 24, 2020
Checkpointing CmdStan sampling Interfaces	24	2134	June 30, 2020
Which Stan interface supports checkpointing? General cmdstan , rstan , stan , cmdstanr	6	902	May 5, 2021
Chkptstanr v0.2.0-alpha: checkpoint brms and cmdstanr sampling Publicity techniques , cmdstanr , brms	6	364	February 29, 2024
Chkptstanr: checkpoint MCMC sampling in Stan General	10	1370	October 1, 2024

Current state of checkpointing in Stan

Related topics