Current state of checkpointing in Stan

I tried both now with CmdStan 2.17 (need to update…) and they checkpoint and restore successfully, except CRIU needs root to restore (though maybe a more recent version could restore as non root).

Did you have segfaults immediately or only when you tried longer runs?

2 Likes

Here’s the bug report I sent to dmtcp: https://sourceforge.net/p/dmtcp/mailman/message/36959396/

I would get crashes in a sequence of checkpoint -> restore -> checkpoint -> restore
The second or (sometimes third) checkpoint would fail.
My use case was fitting models on an slurm backfill queue so I needed to be able to checkpoint and restore repeatedly without issue.

3 Likes

Does anyone want to collaborate on writing a version of the algorithm that enables checkpointing?

I believe in order to do this right, we actually have to rewrite the algorithm so that it has a clearly defined state that can be serialized and deserialized. There aren’t too many pieces state that we need to capture, but the way it’s designed and implemented now, it’s going to be really hard to deserialize that information and plug it back into the algorithm where it matters.

If anyone would like to help, we can start a thread on just that and work towards an implementation that works.

4 Likes

Not sure how much help I can be as my c++ is very rusty. But if successful, this effort could greatly benefit me so I’ll follow along and try to support if I can.

1 Like

It would be great to get it working, but there seem to be a handful of design decisions to make prior to that: should the checkpoint format be stable? In a known format or just fwrite a struct from memory? Should check pointing be threaded through the various APIs or just a signal handler in CmdStan? Should a checkpoint correspond to an accepted proposal or a leapfrog step? How to handle continuing a checkpoint if the output CSV disappears? Etc

Things like CRIU provide a solution at a lower level of abstraction where these questions go away or answered already. If the platform supports CRIU then it’s hard to get motivated to write up rationale for all the questions above, much less make it harder to maintain the algorithms themselves.

2 Likes

Are we talking here having access to prng state (or just advancing it)

Sample 10 -> Sample 10 -> Sample 10 == Sample 30

And then rest of the work is for the user the be sure that metric + stepsizes / inits are included.

Also assuming that warmup is not considered here?

I think check-pointing during warmup is pretty important!

2 Likes

That’s a really neat project. I was thinking about it a little differently. I’ll post thoughts up soon.

@groceryheist, I’m also thinking about check-pointing during warmup. And any help, even if it’s evaluating the goals, is useful.

1 Like