Checkpointing CmdStan sampling

An HPC system I’m using has a walltime limit of 12 hours on batch jobs, which is barely enough for a model I’m working on to get out of warmup. I’d like to checkpoint the CmdStan process and restart it in a second job, but I’m not sure what is the best approach. Some alternatives I’ve thought of

  • Use a generic checkpoint & restore utility, but requires more recent Linux kernel than available on the systems in question

  • Dump Stan’s internal state to file and read back (not implemented AFAIK)

  • Reformat last sample in CSV to rdump format and pass as init=

The last option is easiest, but does not include the valuable information obtained by NUTS during warmup such as the mass matrix, and CmdStan doesn’t expose options to dump or load those. Could this be compensated by a short warmup step?

1 Like

This is close but not implemented yet, I think the mass matrix and stepsizemay be the last that need to be plugged through the interfaces. If they were read from a file implementing it for cmdstan would be pretty easy for someone with a day or two to spare…

If they were read from a file implementing it for cmdstan would be pretty easy for someone with a day or two to spare…

I’m down to try (if nobody else is doing yet). I want this feature.

Had a look at the cmdStan source. Looks like the main entry point is https://github.com/stan-dev/cmdstan/blob/develop/src/cmdstan/command.hpp?

If I did:

  1. Add an argument to specify a mass matrix/vector file
  2. Read in the mass matrix as a context using get_var_context (which looks like it reads the cmdstan Rdump file formats)
  3. Pass that context to the hmc_nuts_dense_e/whatever constructors
  4. Some sorta tests?

You (or @mitzimorris) see anything missing from that?

And what would tests for this look like?

And were the mass matrix/stepsize either both gonna be adapted or both set? Or these are separate options (I’d certainly like the second thing)?

It’d probably be useful to reify the init arg to a section, e.g.

./model init data=init.R \
        output data=first.csv state=hmc.R \
        sample

./model init data=init.R state=hmc.R \
        output data=second.csv
        sample

And what would tests for this look like?

take a look at the services/sample tests:

this is correct.
let me know if you want help on this,
cheers,
Mitzi

1 Like

That’s sounds about right to me. If you do have time I’m happy to help
figure out any problems and do code review.

1 Like

We definitely want tests to make sure that the matrix is read in correctly
otherwise we’d be getting set up for subtle bud later.

@mitzimorris @sakrejda cool beans, I’ll have a go at this later in the week.

1 Like

Did this go anywhere? If there’s an old branch lying around, I might be able to finish up / test.

edit oops I see that one can provide a step size and metric for a NUTS run without adaptation. I guess that’s all that’s required, based on the previous posts in this topic.

Yup, it should be there and working in 2.17.1. Lemme know if you have any trouble getting it to work.

2 Likes

I understand that this is an old conversation, but I could not find any newly information on this.

Is there any documentation on checkpointing on CmdStan?

Thank you,

It’s now possible to parse the mass matrix from a CSV file and use it to initialize sampling in a new chain, but there doesn’t seem to be any tool to automate that (we wrote some custom scripts). Also, this only works after warmup is done; restarting warmup would require doing the warmup schedule in the script outside of CmdStan, if it’s even possible.

That’s right… I don’t think that you can checkpoint right now in the sense of exactly restarting sampling. However, you can run one decent warmup and then fire off a few chains which sample using the already computed mass diagonal. This is obviously to some extent dangerous since the convergence diagnostics assume full independence of the chains which is obviously lost to some degree.

Hm, this is exactly what I thought we could do: run warmup completely, save a CSV file, parse the mass matrix and last accepted sample, and then start a new chain without warmup, and initialize it with the mass matrix and last sample. Isn’t this sufficient?

Last time I looked at this some bits were missing. I think the issue is the state of the random number generator which will be different depending wether you stop or not. However, if you define that all of your Stan runs are stopped at the end of warmup and then you restart as you say, then you can say that things are correctly restarted, yes - it just won’t be the same result without the stop in the middle.

(at least this was my last update on this)

Thank you, maedoc and wds15 for a quick update on this issue.

As I correct in thinking that the custom scripts that maedoc refers to are the source cited in a previous reply by bbbales2?

I’m just not quite sure how to :

parse the mass matrix from a CSV file and use it to initialize sampling in a new chain

Thank you,

are we talking about the discussion that went down for this PR? https://github.com/stan-dev/cmdstan/pull/604

you can run CmdStan with “adapt_engaged=0” (false) and “num_warmup=0” plus step size, mass matrix, param inits - but you won’t be at the same place in the RNG. you could use the same seed and use the ‘id’ as a way to advance it many skips down the road… hack hack hack

To get the mass matrix you have to extract a special comment which is written into the csv outputs files. Another way to get this is to use RStan which can output the mass diagonal.