Saving & reusing adaptation in cmdstanr

@ahartikainen would it be possible for you to put your cmdstanpy stuff in a loop and check a bunch of seeds to see if there’s a difference in the divergence rates between all-in-one and warmup-then-sampling approaches? If not, then the culprit would still have to be cmdstanr rather than cmdstan itself.

Oh, and if you don’t mind, can you use the non-centered schools model? That way we have the expectation of fewer divergences.

For mean number of divergences per chain (5000 draws, 300 * 4 chains) (and standard deviation)

group: (mean, sd)
warm sample: (3.40, 10.18)
warmup+sample: (3.57, 13.23)

import json
from gc import collect

import arviz as az
import cmdstanpy
import matplotlib.pyplot as plt
import pandas as pd


def get_inits(fit, checkpoint_name="init_checkpoint"):
    params_bool = [not item.endswith("__") for item in fit.column_names]
    params = [item for item in fit.column_names if not item.endswith("__")]
    names = []
    for i, init in enumerate([item[params_bool].tolist() for item in fit.draws(inc_warmup=True)[-1]], 1):
        names.append(checkpoint_name + f"_chain_{i}" + ".json")
        with open(names[-1], "w") as f:
            json.dump(dict(zip(params, init)), f)
    return names


def get_stepsize(fit):
    return fit.stepsize.tolist()


def get_metric(fit, checkpoint_name="metric_checkpoint"):
    names = []
    for i, metric in enumerate([item.tolist() for item in fit.metric], 1):
        names.append(checkpoint_name + f"_chain_{i}" + ".json")
        with open(names[-1], "w") as f:
            json.dump({"inv_metric": metric}, f)
    return names


schools_code = """
data {
  int<lower=0> J;         // number of schools
  real y[J];              // estimated treatment effects
  real<lower=0> sigma[J]; // standard error of effect estimates
}
parameters {
  real mu;                // population treatment effect
  real<lower=0> tau;      // standard deviation in treatment effects
  vector[J] eta;          // unscaled deviation from mu by school
}
transformed parameters {
  vector[J] theta = mu + tau * eta;        // school treatment effects
}
model {
  target += normal_lpdf(eta | 0, 1);       // prior log-density
  target += normal_lpdf(y | theta, sigma); // log-likelihood
}
"""


with open("schools_code.stan", "w") as f:
    print(schools_code, file=f)


model = cmdstanpy.CmdStanModel(stan_file="schools_code.stan", exe_file="schools_code.exe")


schools_data = {"J": 8,
                "y": [28,  8, -3,  7, -1,  1, 18, 12],
                "sigma": [15, 10, 16, 11,  9, 11, 10, 18]
               }


total_sum_divergences = []
summaries = []
for seed in range(300):
    fit_warmup = model.sample(
        iter_warmup=1000,
        seed=seed+1,
        iter_sampling=0,
        data=schools_data,
        save_warmup=True,
        show_progress="notebook",
    )

    inits = get_inits(fit_warmup, checkpoint_name=f"init_checkpoint_round_{seed}")
    stepsize = get_stepsize(fit_warmup)
    metric = get_metric(fit_warmup, checkpoint_name=f"metric_checkpoint_round_{seed}")

    fit_samples = model.sample(
        data=schools_data,
        seed=seed+2,
        inits=inits,
        step_size=stepsize,
        metric=metric,
        iter_warmup=0,
        adapt_engaged=False,
        iter_sampling=5000,
    )

    idata = az.from_cmdstanpy(fit_samples)
    sum_divergences = dict(enumerate(idata.sample_stats.diverging.sum(dim="draw").values.tolist()))

    total_sum_divergences.append(sum_divergences)
    summaries.append(az.summary(idata, var_names=["mu", "tau"]))
    del fit_samples
    del idata
    collect()


summary_min_max = pd.concat(summaries).reset_index().groupby("index").apply(lambda x: x.max(numeric_only=True) - x.min(numeric_only=True))

total_sum_divergences_all = []
summaries_all = []
for seed in range(300):
    fit = model.sample(
        data=schools_data,
        seed=seed+1,
        iter_warmup=1000,
        save_warmup=True,
        adapt_engaged=True,
        iter_sampling=5000,
    )

    idata = az.from_cmdstanpy(fit)
    sum_divergences = dict(enumerate(idata.sample_stats.diverging.sum(dim="draw").values.tolist()))

    total_sum_divergences_all.append(sum_divergences)
    summaries_all.append(az.summary(idata, var_names=["mu", "tau"]))
    del fit
    del idata
    collect()

print(
    pd.DataFrame(total_sum_divergences).values.ravel().mean(),
    pd.DataFrame(total_sum_divergences).values.ravel().std()
)


print(
    pd.DataFrame(total_sum_divergences_all).values.ravel().mean(),
    pd.DataFrame(total_sum_divergences_all).values.ravel().std()
)


plt.figure(figsize=(10,4), dpi=100)
plt.hist(pd.DataFrame(total_sum_divergences).values.ravel(), bins=100, label="fit: pre-warm sample");
plt.hist(pd.DataFrame(total_sum_divergences_all).values.ravel(), bins=100, histtype="step", label="fit: with warmup");
plt.yscale("log")
plt.xscale("log")
plt.legend()
plt.title("Number of divergences, 5000 draws per chain, 8-schools non-centered (300 x 4 chains)")
[spine.set_visible(False) for loc, spine in plt.gca().spines.items() if loc in ["top", "right"]]
plt.savefig("restart_divergences", dpi=300, bbox_inches="tight")


summary_min_max_all = pd.concat(summaries_all).reset_index().groupby("index").apply(lambda x: x.max(numeric_only=True) - x.min(numeric_only=True))

print(summary_min_max)

print(summary_min_max_all)
2 Likes

Thanks!

Hm, you’re getting substantially more divergences than me; I can see you have the non-centered version in your code though. When I re-tool my loop to do single chains as you do and 5000 post-warmup samples, I get:

image

So, even though they don’t match your data in the overall number of divergences, this latest from me is not nearly as dramatic a difference as I observed previously.

I wonder if it’s related to running chains in parallel vs not somehow? Checking on that now…

1 Like

@ahartikainen could you try with this data & model? I’m getting much more dramatic results with that set.

(Note: I did install cmdstanpy and tried to do it myself, but when I replace all instances of data=schools_data with data='data_for_stan.json', the warmup-alone bit runs fine but the sampling-alone yields an error complaining about the value to the data argument needing to be a string or dict, which is obviously already is. Note also that this error goes away if I comment-out supplying an init value! Another bug possibly??)

Yes, that is a bug.

Can you read json to dict?

import json
with open("path/to/file.json") as f:
    schools_data = json.load(f)

I will try later today.

I used this line to convert rds to json

write(toJSON(data, pretty = TRUE, auto_unbox = TRUE, digits=16), file="data_for_stan.json")

How did you transform that rds to json?

With Windows, I had first chain stuck at warmup. Now trying with Ubuntu.

I did run this on Windows. I need to test the code again with Ubuntu.

I think I can create a github repo and use github Workflows to test long runs against different OS.

edit.

Ok, looks like it fails with CmdStanPy too.

I will still check pystan.

edit. pystan had less divergences, this is weird

2 Likes

@mitzimorris could you elaborate on why this isn’t a bug? I don’t understand your abbreviated comment closing my bug report. If it were simply about the first sampling iteration, using the initial values, then what am I doing wrong in setting the initial values and wouldn’t you expect just the first few iterations to be divergent? I observe a very high proportion of samples going divergent in the second data & model example I provided.

2 Likes

first off, apologies for the too terse message. I tried to find a better explanation than what I offered. Perhaps the discussion here would be useful: Current state of checkpointing in Stan

this issue comes up alot and it is indeed something we need to figure out.

Hm, as I understand it checkpointing is about achieving capture/reinstatement of all rng states, which if achieved would permit yield identical samples between a run with warmup and sampling together and two runs, one with warmup and one with sampling. However, absent such checkpointing, I expected that capture/reinstatement of at least the inv_metric and step_size, plus use of the final warmup draw as inits, would at least place the sampling run in the typical set with HMC parameters that should yield not-identical-but-roughly-equivalent performance in sampling, as measured by the rate of encountering divergences, ESS, Rhat, etc. But what I observe, particularly with the second data/model I posted on the issue, is that while the combined warmup+sampling runs almost never encounter divergences, the split warmup-then-sampling runs nearly always do, and when they do they have divergences for the majority of samples.

Where is my understanding faulty?

the sampler takes the initial parameter values and tries to use them, but if for some reason
those inits fail to meet constraints, it will try to use other values, therefore it starts in a bad place.
as you don’t do any further adaptation, it continues to have problems.
perhaps @bbbales2 has more insight as to what’s going on in these examples.

This seems like a bug to me. I looked at the scripts earlier and I definitely think it should work and it seems to have worked a couple times earlier in this thread even (here and here). Something going on. Hopefully @mike-lawrence and @ahartikainen figure it out!

But the inits are coming from the final draw of the warmup, so aren’t those guaranteed to meet the model contraints?

if it sometimes works and sometimes doesn’t, then it sounds like it’s hit a difficult model/data combo. there’s a loss of precision when you dump out the draw and then read it back in again. parameter values very close to one are particularly problematic.

for the problematic example, has the chain properly converged during the warmup phase?

I’m using sig_figs=18 during the warmup run

There is a cholesky_factor_corr parameter that, when I look at the summary from a run with warmup-and-sampling done together, has a constant value of 1 in the [1,1] entry, but when I look at the inits and test if they are precisely equal to one, the init value for that entry is.

I think so? If it weren’t, I’d expect more issues with the runs where I do warmup and sampling together. Also, I still observe the same behaviour when I bump the warmup iterations up to 1e4.

Here’s a gist containing the stan file, the code to generate data and the code to loop over seeds, generating new data and attempting to sample in the two ways for each seed.

Data and model are a standard hierarchical model where there are 10 subjects each observed with common gaussian measurement noise 10 times in each of 2 conditions, and the subjects’ intercept and condition effect are modelled as multivariate normal.

I just ran it for 100 seeds and where only about 5% of traditional runs encounter divergences, 96% of of the two-stage runs had them, and lots of them:

image

In R I used cmdstanr::write_stan_json()

Tried this and I get the same error (and just updated the cmdstanpy bug report to reflect this).

@ahartikainen you can ignore at least the part of this thread where I had trouble using cmdstanpy; turned out I was still using an old version and the latest version runs my examples fine. Starting a loop to confirm that the core issue of increased divergences manifests in cmdstanpy as it does with cmdstanr…

Yup, same behaviour in cmdstanpy as cmdstanr, so seems to be something core to cmdstan itself.