Checkpointing CmdStan sampling

maedoc · October 6, 2017, 7:41am

An HPC system I’m using has a walltime limit of 12 hours on batch jobs, which is barely enough for a model I’m working on to get out of warmup. I’d like to checkpoint the CmdStan process and restart it in a second job, but I’m not sure what is the best approach. Some alternatives I’ve thought of

Use a generic checkpoint & restore utility, but requires more recent Linux kernel than available on the systems in question
Dump Stan’s internal state to file and read back (not implemented AFAIK)
Reformat last sample in CSV to rdump format and pass as init=

The last option is easiest, but does not include the valuable information obtained by NUTS during warmup such as the mass matrix, and CmdStan doesn’t expose options to dump or load those. Could this be compensated by a short warmup step?

sakrejda · October 6, 2017, 12:24pm

This is close but not implemented yet, I think the mass matrix and stepsizemay be the last that need to be plugged through the interfaces. If they were read from a file implementing it for cmdstan would be pretty easy for someone with a day or two to spare…

bbbales2 · October 8, 2017, 7:41pm

If they were read from a file implementing it for cmdstan would be pretty easy for someone with a day or two to spare…

I’m down to try (if nobody else is doing yet). I want this feature.

Had a look at the cmdStan source. Looks like the main entry point is https://github.com/stan-dev/cmdstan/blob/develop/src/cmdstan/command.hpp?

If I did:

Add an argument to specify a mass matrix/vector file
Read in the mass matrix as a context using get_var_context (which looks like it reads the cmdstan Rdump file formats)
Pass that context to the hmc_nuts_dense_e/whatever constructors
Some sorta tests?

You (or @mitzimorris) see anything missing from that?

And what would tests for this look like?

bbbales2 · October 8, 2017, 7:50pm

And were the mass matrix/stepsize either both gonna be adapted or both set? Or these are separate options (I’d certainly like the second thing)?

maedoc · October 8, 2017, 7:53pm

It’d probably be useful to reify the init arg to a section, e.g.

./model init data=init.R \
        output data=first.csv state=hmc.R \
        sample

./model init data=init.R state=hmc.R \
        output data=second.csv
        sample

mitzimorris · October 8, 2017, 8:12pm

And what would tests for this look like?

take a look at the services/sample tests:

mitzimorris · October 8, 2017, 8:15pm

this is correct.
let me know if you want help on this,
cheers,
Mitzi

sakrejda · October 8, 2017, 9:47pm

That’s sounds about right to me. If you do have time I’m happy to help
figure out any problems and do code review.

sakrejda · October 8, 2017, 9:48pm

We definitely want tests to make sure that the matrix is read in correctly
otherwise we’d be getting set up for subtle bud later.

bbbales2 · October 9, 2017, 3:51pm

@mitzimorris @sakrejda cool beans, I’ll have a go at this later in the week.

maedoc · June 26, 2018, 12:45pm

Did this go anywhere? If there’s an old branch lying around, I might be able to finish up / test.

edit oops I see that one can provide a step size and metric for a NUTS run without adaptation. I guess that’s all that’s required, based on the previous posts in this topic.

bbbales2 · June 26, 2018, 1:16pm

Yup, it should be there and working in 2.17.1. Lemme know if you have any trouble getting it to work.

philophthalmus · February 7, 2019, 12:39am

I understand that this is an old conversation, but I could not find any newly information on this.

Is there any documentation on checkpointing on CmdStan?

Thank you,

maedoc · February 7, 2019, 7:13am

It’s now possible to parse the mass matrix from a CSV file and use it to initialize sampling in a new chain, but there doesn’t seem to be any tool to automate that (we wrote some custom scripts). Also, this only works after warmup is done; restarting warmup would require doing the warmup schedule in the script outside of CmdStan, if it’s even possible.

wds15 · February 7, 2019, 9:09am

That’s right… I don’t think that you can checkpoint right now in the sense of exactly restarting sampling. However, you can run one decent warmup and then fire off a few chains which sample using the already computed mass diagonal. This is obviously to some extent dangerous since the convergence diagnostics assume full independence of the chains which is obviously lost to some degree.

maedoc · February 7, 2019, 9:11am

Hm, this is exactly what I thought we could do: run warmup completely, save a CSV file, parse the mass matrix and last accepted sample, and then start a new chain without warmup, and initialize it with the mass matrix and last sample. Isn’t this sufficient?

wds15 · February 7, 2019, 9:17am

Last time I looked at this some bits were missing. I think the issue is the state of the random number generator which will be different depending wether you stop or not. However, if you define that all of your Stan runs are stopped at the end of warmup and then you restart as you say, then you can say that things are correctly restarted, yes - it just won’t be the same result without the stop in the middle.

(at least this was my last update on this)

philophthalmus · February 7, 2019, 3:54pm

Thank you, maedoc and wds15 for a quick update on this issue.

As I correct in thinking that the custom scripts that maedoc refers to are the source cited in a previous reply by bbbales2?

I’m just not quite sure how to :

parse the mass matrix from a CSV file and use it to initialize sampling in a new chain

Thank you,

mitzimorris · February 7, 2019, 3:58pm

are we talking about the discussion that went down for this PR? https://github.com/stan-dev/cmdstan/pull/604

you can run CmdStan with “adapt_engaged=0” (false) and “num_warmup=0” plus step size, mass matrix, param inits - but you won’t be at the same place in the RNG. you could use the same seed and use the ‘id’ as a way to advance it many skips down the road… hack hack hack

wds15 · February 8, 2019, 8:18am

To get the mass matrix you have to extract a special comment which is written into the csv outputs files. Another way to get this is to use RStan which can output the mass diagonal.

Topic		Replies	Views
Which Stan interface supports checkpointing? General cmdstan , rstan , stan , cmdstanr	6	902	May 5, 2021
Current state of checkpointing in Stan Developers features	27	3102	November 18, 2020
Checkpointing with CmdStanPy General	2	550	September 24, 2020
Benchmarking and Resuming sampling via DMTCP after interruption Modeling	2	45	February 12, 2025
Chkptstanr: checkpoint MCMC sampling in Stan General	10	1370	October 1, 2024

Checkpointing CmdStan sampling

Related topics