I am currently running computationally demanding Stan models using CmdStan on a Linux server. Since the models take longer to run than the maximum time limit of a batch job, the server host recommended me to use checkpointing. I found the tool DMTCP, which supports checkpointing for R. Would this also work using RStan or CmdStanR?
In general, I would prefer to continue working with CmdStan, but there is no option for checkpointing, correct?
I would appreciate any suggestions and/or references to tutorials.
Checkpointing can mean slightly different things:
- Resuming sampling such that the entire state of the random number generator is reinstated, yielding numerically identical samples as if you hadn’t stopped.
- Resuming sampling such that you don’t “lose work” in the sampler’s efforts to adapt, but don’t reinstate the full RNG state and thereby yield functionally equivalent but not-numerically-identical samples as if you hadn’t stopped.
I don’t believe that any of the interfaces achieve 1, and while you can do 2, it’s a bit of a manual process at present.
what @mike-lawrence said is correct - none of the interfaces support checkpointing.
first off - there’s always the “folk theorem” question - maybe there’s a problem with your model - see https://arxiv.org/pdf/2011.01808.pdf, section 5.1
is you model taking a long time during warmup? do you have confidence that the model has converged during warmup?
in theory, you can continue sampling by initializing the parameters, and setting the stepsize and inv_metric - interface CmdStanPy lets you access the model parameters as properly structured Python variables, which means that you could, in theory, get the parameter variables from your sample, CmdStanPy method
stan_variables, take the last draw, dump them to a JSON dict and use that to initialize your parameters, then continue running post warmup, given specified step_size, metric, and init params.
as Mike said, not automatic.
It’s slightly awkard but possible using (extending) CmdStanPy.
To make this semi-automatic the model needs to know how the parameters are called, although this might not actually be necessary.
not sure what you mean? if you supply a JSON file containing a dict over all Stan program variables as the initial parameter variables file, then the Stan I/O should do the right thing - supply the parameters, ignore the other variables.
Yes, I wasn’t sure whether Stan would complain, hence
That being said, if there are few parameters but many many transformed parameters and generated quantities things might get awkward.
agreed - currently very clunky.