Current state of checkpointing in Stan

As checkpointing seems to have multiple meanings in computing these days, I mean periodically saving job progress to allow one to restart jobs on a batch system that may have stopped due to priority issue or transient failure.

The other post on this topic had its last few replies in February of this year and I was curious if anything further had been done to bring this functionality to CmdStan or any of the other interfaces. The replies from various CoreTeam developers does make it clear that doing this at any sampling stage is a non-trivial task so I can understand that this may be deep in the queue of feature requests.

I ask because I mostly have worked with large parallelizable workflows on batch (HTCondor/LSF) systems and having checkpointing in Stan would be a good selling point to some former colleagues over in astronomy.

1 Like

We work a lot with Slurm systems where job walltime is limited to 24h typically, and some of our models take weeks to run. We saw two approaches available, as the core Stan code doesn’t seem amenable to explicit checkpointing:

(1) run the model for some number of iterations which does fit in the walltime limit, take the output as an intialization for another job, and iterate until completion. This is challenging for warmup since you have to implement the schedule outside of Stan but seems to work

(2) use something like CRIU which is a Linux-kernel dependent technique to dump the process to files and restart it in a second job.

These are different from typical checkpointing in HPC since usually you would checkpoint and continue, but still may be applicable for your case.

1 Like

Have you considered something similar as in https://github.com/ahartikainen/fit_check_fit_loop

It’s implemented with PyStan+ArviZ, but the same thing can be done with CmdStan (and for diagnostics you can use ArviZ or manually combine csv files + RStan monitor)

1 Like

Yes, I forgot to mention that PyStan now seems to implement very helpful APIs for doing this, but it wasn’t the case back when (~2017) I had to implement. ArviZ looks really nice btw.

This hasn’t been a big priority for us, so I don’t know that anything’s been done.

After warmup, it’s easy to restart given the adapted mass matrix/metric and step size and last recorded position. Before warmup has finished, there’s much more state to maintain around the adaptation.