Current state of checkpointing in Stan

maedoc · November 12, 2020, 9:59pm

I tried both now with CmdStan 2.17 (need to update…) and they checkpoint and restore successfully, except CRIU needs root to restore (though maybe a more recent version could restore as non root).

Did you have segfaults immediately or only when you tried longer runs?

groceryheist · November 12, 2020, 10:06pm

Here’s the bug report I sent to dmtcp: https://sourceforge.net/p/dmtcp/mailman/message/36959396/

I would get crashes in a sequence of checkpoint -> restore -> checkpoint -> restore
The second or (sometimes third) checkpoint would fail.
My use case was fitting models on an slurm backfill queue so I needed to be able to checkpoint and restore repeatedly without issue.

syclik · November 17, 2020, 2:26pm

Does anyone want to collaborate on writing a version of the algorithm that enables checkpointing?

I believe in order to do this right, we actually have to rewrite the algorithm so that it has a clearly defined state that can be serialized and deserialized. There aren’t too many pieces state that we need to capture, but the way it’s designed and implemented now, it’s going to be really hard to deserialize that information and plug it back into the algorithm where it matters.

If anyone would like to help, we can start a thread on just that and work towards an implementation that works.

groceryheist · November 17, 2020, 8:51pm

Not sure how much help I can be as my c++ is very rusty. But if successful, this effort could greatly benefit me so I’ll follow along and try to support if I can.

maedoc · November 17, 2020, 9:47pm

It would be great to get it working, but there seem to be a handful of design decisions to make prior to that: should the checkpoint format be stable? In a known format or just fwrite a struct from memory? Should check pointing be threaded through the various APIs or just a signal handler in CmdStan? Should a checkpoint correspond to an accepted proposal or a leapfrog step? How to handle continuing a checkpoint if the output CSV disappears? Etc

Things like CRIU provide a solution at a lower level of abstraction where these questions go away or answered already. If the platform supports CRIU then it’s hard to get motivated to write up rationale for all the questions above, much less make it harder to maintain the algorithms themselves.

ahartikainen · November 17, 2020, 10:11pm

Are we talking here having access to prng state (or just advancing it)

Sample 10 -> Sample 10 -> Sample 10 == Sample 30

And then rest of the work is for the user the be sure that metric + stepsizes / inits are included.

Also assuming that warmup is not considered here?

groceryheist · November 17, 2020, 11:07pm

I think check-pointing during warmup is pretty important!

syclik · November 18, 2020, 9:54pm

That’s a really neat project. I was thinking about it a little differently. I’ll post thoughts up soon.

@groceryheist, I’m also thinking about check-pointing during warmup. And any help, even if it’s evaluating the goals, is useful.

Topic		Replies	Views
Checkpointing with CmdStanPy General	2	550	September 24, 2020
Checkpointing CmdStan sampling Interfaces	24	2134	June 30, 2020
Which Stan interface supports checkpointing? General cmdstan , rstan , stan , cmdstanr	6	902	May 5, 2021
Chkptstanr v0.2.0-alpha: checkpoint brms and cmdstanr sampling Publicity techniques , cmdstanr , brms	6	364	February 29, 2024
Chkptstanr: checkpoint MCMC sampling in Stan General	10	1370	October 1, 2024

Current state of checkpointing in Stan

Related topics