Resume sampling after interuption

I’m using cmdstanpy and would like to be able to resume sampling after losing my session. This could help save costs by allowing slower models to run on AWS spot instances. Since I can’t find any concrete examples out there, I’d appreciate some feedback on how this should be done.

This seems to work for the Bernoulli example in the docs:

import cmdstanpy
from pathlib import Path
import json

seed = 98386501

stan_file = Path(cmdstanpy.cmdstan_path()) / 'examples' / 'bernoulli' / 'bernoulli.stan'
model = cmdstanpy.CmdStanModel(stan_file=stan_file)
data = {"N": 10, "y": [0,1,0,0,0,0,0,0,0,1]}

sample = model.sample(data=data, iter_warmup=500, iter_sampling=2, chains=1, seed=seed)

metric = sample.metric.T[0].tolist()
step_size = sample.stepsize.tolist()
inits = sample.get_drawset().iloc[-1].to_dict()

metric_file = "metric.json"
with open(metric_file, 'w') as f:
    json.dump(dict(inv_metric=[metric[0]]), f)
        
magic = {
    "step_size": step_size[0],
    "metric": metric_file,
    "adapt_engaged": False,
    "iter_warmup": 0,
    "inits": inits
}
    
sample_resumed = cmdstanpy.CmdStanModel(stan_file=stan_file).sample(data=data, **magic, iter_sampling=1000, chains=1, seed=seed+1)
sample_resumed.summary()

The results are reasonable and look like they fit to the data correctly.

My questions:

  1. Is this doing what I think it’s doing? I.e. continuing where the first sample left off?
  2. The sample gives us a metric&stepsize per chain. How can I run model.sampling again with the different metrics&stepsizes for each chain? Are the json files supposed to be separate for each chain?
  3. The above assumes the warmup can be completed fully. Is there any way to also resume an incomplete warmup?

Operating System: Ubuntu 19.10
Interface: cmdstanpy 0.9.5

5 Likes

It’s starting off from something that finished sampling.

It looks reasonable, but a couple problems:

  1. If warmup hadn’t finished, then you wouldn’t have a metric.

  2. I don’t know if there’s any way to cleanly stop a cmdstanpy job if things get interrupted partway through (I don’t know how spot instances work).

  3. I’m scared partial output might break this process.

There’s a checkpointing thread over here that has some info: Current state of checkpointing in Stan

There is something in cmdstan called the diagnostic file (I think recently [last few days] it has been renamed the latent dynamics file, but I don’t know if that’s made it to cmdstanpy yet) that could be used to do this.

@mitzimorris is there a way to provide a stepsize per-chain in cmdstanpy?

1 Like

I would call sample from try-except block and “manually” saved the partial results. Then it would probably make sense handle partial results as one-chain sample in different threads (manually doing what we do for multiple chains).

We should probably create some examples.

yes - https://cmdstanpy.readthedocs.io/en/latest/api.html#cmdstanpy.CmdStanModel.sample

  • step_size – Initial stepsize for HMC sampler. The value is either a single number or a list of numbers which will be used as the global or per-chain initial step_size, respectively. The length of the list of step sizes must match the number of chains.

there’s also accessor functions to retrieve stepsize and mass matrix from the resulting CmdStanMCMC object.

this API is still in beta and we’re still getting the names right - as of the 1.0 release metric will be inv_metric and the CmdStanMCMC property name stepsize will be step_size. apologies in advance for the disruption.

yes, this is available in CmdStanPy - arg to sample method:

  • save_diagnostics – Whether or not to save diagnostics. If True, csv output files are written to -diagnostic-<chain_id>.csv., where is set with csv_basename.