An HPC system I’m using has a walltime limit of 12 hours on batch jobs, which is barely enough for a model I’m working on to get out of warmup. I’d like to checkpoint the CmdStan process and restart it in a second job, but I’m not sure what is the best approach. Some alternatives I’ve thought of
Use a generic checkpoint & restore utility, but requires more recent Linux kernel than available on the systems in question
Dump Stan’s internal state to file and read back (not implemented AFAIK)
Reformat last sample in CSV to rdump format and pass as init=
The last option is easiest, but does not include the valuable information obtained by NUTS during warmup such as the mass matrix, and CmdStan doesn’t expose options to dump or load those. Could this be compensated by a short warmup step?
This is close but not implemented yet, I think the mass matrix and stepsizemay be the last that need to be plugged through the interfaces. If they were read from a file implementing it for cmdstan would be pretty easy for someone with a day or two to spare…
It’s now possible to parse the mass matrix from a CSV file and use it to initialize sampling in a new chain, but there doesn’t seem to be any tool to automate that (we wrote some custom scripts). Also, this only works after warmup is done; restarting warmup would require doing the warmup schedule in the script outside of CmdStan, if it’s even possible.
That’s right… I don’t think that you can checkpoint right now in the sense of exactly restarting sampling. However, you can run one decent warmup and then fire off a few chains which sample using the already computed mass diagonal. This is obviously to some extent dangerous since the convergence diagnostics assume full independence of the chains which is obviously lost to some degree.
Hm, this is exactly what I thought we could do: run warmup completely, save a CSV file, parse the mass matrix and last accepted sample, and then start a new chain without warmup, and initialize it with the mass matrix and last sample. Isn’t this sufficient?
Last time I looked at this some bits were missing. I think the issue is the state of the random number generator which will be different depending wether you stop or not. However, if you define that all of your Stan runs are stopped at the end of warmup and then you restart as you say, then you can say that things are correctly restarted, yes - it just won’t be the same result without the stop in the middle.
you can run CmdStan with “adapt_engaged=0” (false) and “num_warmup=0” plus step size, mass matrix, param inits - but you won’t be at the same place in the RNG. you could use the same seed and use the ‘id’ as a way to advance it many skips down the road… hack hack hack