An HPC system I’m using has a walltime limit of 12 hours on batch jobs, which is barely enough for a model I’m working on to get out of warmup. I’d like to checkpoint the CmdStan process and restart it in a second job, but I’m not sure what is the best approach. Some alternatives I’ve thought of
Use a generic checkpoint & restore utility, but requires more recent Linux kernel than available on the systems in question
Dump Stan’s internal state to file and read back (not implemented AFAIK)
Reformat last sample in CSV to rdump format and pass as init=
The last option is easiest, but does not include the valuable information obtained by NUTS during warmup such as the mass matrix, and CmdStan doesn’t expose options to dump or load those. Could this be compensated by a short warmup step?