Rewriting example models with SBC

I rewrote the Rats model (first example model in Volume I of the BUGS examples) to get a feel for what doing it the right way looks like. I’m trying to get it to pass simulation-based calibration as written (this is the revision of the Cook-Gelman-Rubin diagnostics):

There are three problems I see.

  1. The model itself is so ugly as to be unreadable with all the variable scaling. The multiplier/offset thing’s going to save a lot of headaches.

  2. It takes hours to run Stan with tight enough settings and with enough iterations that it passes SBC as written. This was with 1000 saved iterations at thinning 5 and 256 simulations. I set up chi-squared tests for uniformity.

    • I could use some guidance on iterations, thinning, and tests for uniformity here. The Talts et al. arXiv paper just has you eyeball histograms! I need something that can be automated. This is going to have to be put on a cluster if we want to run them all in some kind of development cycle.
    • My R code’s terrible, but all the time’s spent in Stan fitting. This relatively simple hierarchical model takes a minute or two to fit on my notebook when I crank adapt_delta up to 0.99. Part of the problem is probably the super-wide priors. I tightened them considerably from BUGS, but didn’t want to go below normal(0, 2 * theta) for a weakly informative prior on theta. The signs aren’t controlled on any of the variables.
  3. It’s going to wind up taking at least a couple hours per model to revise, find the right settings, and validate. There are hundreds of example models, so this is something like a 500 hour project to revise all the models. That’s 1/4 of a full-time year doing nothing else. So nobody should be holding their breath for this.

    • It may be better to start with zero and add ones that work rather than trying to deal with an unvalidating and sprawling set of 200 examples that we don’t even know if Stan fits properly with NUTS.

The “good” Rats program.

data {
  int<lower=0> N;
  int<lower=0> T;
  vector[T] x;
  matrix[N, T] y;
}
transformed data {
  real xbar = mean(x);
  vector[T] x_ctr = (x - xbar);
}
parameters {
  vector[N] alpha_std;
  real mu_alpha_std;
  real<lower = 0> sigma_alpha_std;

  vector[N] beta_std;
  real mu_beta_std;  // beta.c in original
  real<lower = 0> sigma_beta;

  real<lower = 0> sigma_y_std;
}
transformed parameters {
  real mu_alpha = mu_alpha_std * 500;
  real sigma_alpha = sigma_alpha_std * 50;
  real mu_beta = mu_beta_std * 12;
  real sigma_y = sigma_y_std * 12;
  vector[N] alpha = mu_alpha + sigma_alpha * alpha_std;
  vector[N] beta = mu_beta + sigma_beta * beta_std;
}
model {
  mu_alpha_std ~ normal(0, 1);
  sigma_alpha_std ~ normal(0, 1);
  alpha_std ~ normal(0, 1);

  mu_beta_std ~ normal(0, 1);
  sigma_beta ~ normal(0, 1);
  beta_std ~ normal(0, 1);

  sigma_y_std ~ normal(0, 1);

  for (t in 1:T)
    y[ , t] ~ normal(alpha + beta * x_ctr[t], sigma_y);
}
generated quantities {
  real alpha0 = mu_alpha - xbar * mu_beta;
}

Not good. Now here’s my SBC code (spoiler alert: not good). I don’t know how to automate the variable names to check here. I just pulled out the five they talked about in the BUGS writeup and monitored them by hand.

I just copied the data over—it should be sourced into scope or even better, put in an encapsulated scope rather than polluting the top level.

N <- 30
T <- 5
x <- c(8, 15, 22, 29, 36)
xbar <- 22
x_ctr <- x - xbar

rats_model <- stan_model("rats.stan")

M <- 256
K <- 5
ranks <- matrix(NA, K, M)
names <- c("mu_alpha", "sigma_alpha", "mu_beta", "sigma_beta", "sigma_y")

for (m in 1:M) {
  print(sprintf("m = %d", m), quote = FALSE)
  mu_alpha <- rnorm(1, 0, 500)
  sigma_alpha <- abs(rnorm(1, 0, 50))
  alpha <- rnorm(N, mu_alpha, sigma_alpha)

  mu_beta <- rnorm(1, 0, 12)
  sigma_beta <- abs(rnorm(1, 0, 1))
  beta <- rnorm(N, mu_beta, sigma_beta)

  sigma_y <- abs(rnorm(1, 0, 12))

  y <- matrix(NA, N, T)
  for (t in 1:T)
    y[1:N, t] <- rnorm(N, alpha + beta * x_ctr[t], sigma_y)

  data_sim <- list(N = N, T = T, x = x, y = y)

  fit <- sampling(rats_model, data = data_sim,
                  chains = 1, warmup = 2000, iter=7000,
		  thin = 5,
		  control = list(adapt_delta = 0.99),
                  seed = 1234, refresh = 0)
  fit_ss <- extract(fit)
  ranks[1, m] <- rank(c(mu_alpha, fit_ss$mu_alpha))[1]
  ranks[2, m] <- rank(c(sigma_alpha, fit_ss$sigma_alpha))[1]
  ranks[3, m] <- rank(c(mu_beta, fit_ss$mu_beta))[1]
  ranks[4, m] <- rank(c(sigma_beta, fit_ss$sigma_beta))[1]
  ranks[5, m] <- rank(c(sigma_y, fit_ss$sigma_y))[1]
}

library(spgs)

for (k in 1:K) {
  p_value <- chisq.unif.test(ranks[k, 1:M] / 1001)$p.value
  print(sprintf("%s:  p value = %10.8f",
                names[k], p_value),
        quote = FALSE)
}

Also, I used library sbgs which was the first hit on Google for chi-squared test in R that worked on data.

Here’s the output:

[1] mu_alpha:  p value = 0.83118519
[1] sigma_alpha:  p value = 0.28289316
[1] mu_beta:  p value = 0.06181241
[1] sigma_beta:  p value = 0.93033231
[1] sigma_y:  p value = 0.87538122

So it all looks OK.

P.S. Here’s the Rats program revised with multiplier/offset scaling The variable declarations become clunky and hard to read, but the model block is much more readable.

data {
  int<lower=0> N;
  int<lower=0> T;
  vector[T] x;
  matrix[N, T] y;
}
transformed data {
  real xbar = mean(x);
  vector[T] x_ctr = (x - xbar);
}
parameters {
  real<multiplier = 500> mu_alpha;
  real<lower = 0><multiplier = 50> sigma_alpha;
  vector<offset = mu_alpha, multiplier = sigma_alpha>[N] alpha;

  real<multiplier = 12> mu_beta;
  real<lower = 0><multiplier = 1> sigma_beta;
  vector<offset = mu_beta, multiplier = sigma_beta>[N] beta;

  real<lower = 0><multiplier = 12> sigma_y;
}
model {
  mu_alpha ~ normal(0, 500);
  sigma_alpha ~ normal(0, 50);
  alpha ~ normal(mu_alpha, sigma_alpha);

  mu_beta ~ normal(0, 12);
  sigma_beta ~ normal(0, 1);
  beta ~ normal(mu_beta, sigma_beta)

  sigma_y_std ~ normal(0, 12);

  for (t in 1:T)
    y[ , t] ~ normal(alpha + beta * x_ctr[t], sigma_y);
}
generated quantities {
  real alpha0 = mu_alpha - xbar * mu_beta;
}
1 Like

I have just coded up SBC which I use as an example for demonstrating the R package batchtools. The benefit of that package is that you can easily setup the simulation - fitting pipeline on your computer and then migrate to a cluster to do the real work. I am happy to share that part.

… BTW, doing so I discovered that rstanarm rejects binomial data with no responses… need to file an issue.

I completely agree with the awkward state of the example models. Not only are they written such that the parameters are all on varied scales, the priors are in general way too diffuse and there has been no systematic verification that they are reasonably well-identified and amenable to fitting. I think the most useful step forward is to go through one or two case studies demonstrating the process of analyzing and updating an example models and hope that it will be enough guidance for enterprising members of the community to emulate. Then we can add transfer models to a “verified” folder as they’re updated.

Regarding the implementation of SBC, the implementation I use in my courses is that found in my workflow case studies, https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html (RStan) and https://github.com/betanalpha/jupyter_case_studies/blob/master/principled_bayesian_workflow/principled_bayesian_workflow.ipynb (PyStan). Because the example model is small, the startup cost of the parallelization of the four chains per fit ends up being prohibitive so I serialized their execution and just parallelized over the simulations. I think that the optimal parallelization strategy for larger models is a giant open question.

A robust thinning strategy is going to be hard and will probably require dynamically thinning each fit within each simulation to avoid having to run the entire ensemble over and over again. @avehtari has updated recommendations for thinning strategies in the latest draft of the paper, and as far as I know all of the SBC examples in the paper and my material all use heuristic thinning.

Finally, we intentionally didn’t include a numerical diagnostic of uniformity in the paper. All of the usual tests either suffer from super high false discovery rates or super low true discovery rates, and none were effective at isolating the systematic structure that’s obvious in the visualized histograms. This is still an open problem, but for now I’ve been sticking to the histograms.

2 Likes

Yeah, if you look at the paper at the histograms from 8 schools NCP without thinning, it’s hard to find a numerical test that fails on that, and I guess we think that should fail.

What’s the goal with SBC here? Is it to flag bad models or to aid in fixing models? If it’s to flag bad models, we could use a really coarse grained approach, something like 10 posterior samples, 50 replications, and then chi squared uniformity test or something similar. I think generalized recommendations are a bit tough but if you don’t care about MCMC error, you could play around with 8 schools NCP vs. 8 schools CP and see how few you need to do to tell that the latter is a bad model. Worth noting that divergences also often identify models that are that bad.

In case anyone wants to see it, the original parallelized R and Python code is in the paper repo: https://github.com/seantalts/simulation-based-calibration I like the Python code more or less (but I sort of cheated and used a helper package for parallelization called pathos) but the R code is a mess, sorry about that.

1 Like

Hi, I have a comment and a question.

The comment is that I do think the model looks much cleaner with the scaled parameterization, and it will look even cleaner once we can fold the priors into the declarations. Right now, one reason that even the cleaner code looks messy is that there is duplication (the “500” that’s in the declaration of mu_alpha and also in the prior; the “50” that’s in both places for sigma_alpha, the “12” that’s in both places for mu_beta, etc.). So we should talk about how to best set up the language to code these sorts of things cleanly. Beyond making code easier to read, I’m sure it will dramatically reduce the incidence of bugs in this sort of code.

My question is why this particular model is so difficult to run. I’ve never looked at this example, so I don’t really have a sense of what’s going on here.

I, for one, would prefer if everyone else would adhere to my conventions, which are

  1. Draw from the prior distributions in the transformed data block of the Stan program appending a trailing underscore to the symbol name
  2. Draw from the prior predictive distribution in the transformed data block rather than passing the outcome into the data block
  3. Declare the parameters (or transformed parameters if applicable) in the same order as the declarations in the transformed data block
  4. Calculate the inequalities in the generated quantities block with the symbol name ranks_ (open to a different name since those are just 0s and 1s rather than ranks yet)

That would currently look like

data {
  int<lower=0> N;
  int<lower=0> T;
  vector[T] x;
}
transformed data {
  real xbar = mean(x);
  vector[T] x_ctr = x - xbar;

  // parameter realizations
  real sigma_beta_ = fabs(normal_rng(0, 1));
  real mu_alpha_ = normal_rng(0, 500);
  real sigma_alpha_ = fabs(normal_rng(0, 12));
  real mu_beta_ = normal_rng(0, 12);
  real sigma_y_ = fabs(normal_rng(0, 12));
  matrix[N, N] I = diag_matrix(rep_vector(1, N));
  vector[N] zeros = rep_vector(0, N);
  vector[N] alpha_ = mu_alpha_ + sigma_alpha_ * multi_normal_cholesky_rng(zeros, I);
  vector[N] beta_ = mu_beta_ + sigma_beta_ * multi_normal_cholesky_rng(zeros, I);
  matrix[N, T] y; // prior predictive realizations
  for (n in 1:N) for (t in 1:T) 
    y[n, t] = normal_rng(alpha_[n] + beta_[n] * x_ctr[t], sigma_y_);
}
parameters {
  vector[N] alpha_std;
  real mu_alpha_std;
  real<lower = 0> sigma_alpha_std;

  vector[N] beta_std;
  real mu_beta_std;  // beta.c in original
  real<lower = 0> sigma_beta;

  real<lower = 0> sigma_y_std;
}
transformed parameters {
  real mu_alpha = mu_alpha_std * 500;
  real sigma_alpha = sigma_alpha_std * 50;
  real mu_beta = mu_beta_std * 12;
  real sigma_y = sigma_y_std * 12;
  vector[N] alpha = mu_alpha + sigma_alpha * alpha_std;
  vector[N] beta = mu_beta + sigma_beta * beta_std;
}
model {
  mu_alpha_std ~ normal(0, 1);
  sigma_alpha_std ~ normal(0, 1);
  alpha_std ~ normal(0, 1);

  mu_beta_std ~ normal(0, 1);
  sigma_beta ~ normal(0, 1);
  beta_std ~ normal(0, 1);

  sigma_y_std ~ normal(0, 1);

  for (t in 1:T)
    y[ , t] ~ normal(alpha + beta * x_ctr[t], sigma_y);
}
generated quantities {
  real alpha0 = mu_alpha - xbar * mu_beta;
  int ranks_[5 + 2 * N];
  ranks_[1] = sigma_beta > sigma_beta_;
  ranks_[2] = mu_alpha > mu_alpha_;
  ranks_[3] = sigma_alpha > sigma_alpha_;
  ranks_[4] = mu_beta > mu_beta_;
  ranks_[5] = sigma_y > sigma_y_;
  for (n in 1:N) {
    ranks_[5 + n] = alpha[n] > alpha_[n];
    ranks_[5 + n + N] = beta[n] > beta_[n];
  }
}

Then it is fairly easy to calculate / graph from any interface besides CmdStan. In R, it would be

library(rstan)
library(parallel)
sm <- stan_model("rats.stan")
dat <- list(N = 30, T = 5, x = c(8, 15, 22, 29, 36))
ranks_ <- mclapply(1:256, mc.cores = detectCores(), FUN = function(m) {
  extract(sampling(sm, data = dat, seed = m, chains = 1L, refresh = 0L), 
          pars = "ranks_", permuted = FALSE)[,1,]
  }) # seed = m to get reproducibility w/o duplicated parameter realizations
thinner <- seq(from = 1, to = 1000, by = 5) # can tweak
u <- t(sapply(ranks_, FUN = function(r) 1 + colSums(r[thinner,])))
colnames(u) <- c("sigma_beta", "mu_alpha", "sigma_alpha", "mu_beta", "sigma_y",
                 paste0("alpha_", 1:dat$N), paste0("beta_", 1:dat$N))

par(mfrow = c(5, 13), mar = c(4,2,1,1) + .1, las = 1)
for (j in colnames(u)) hist(u[,j], main = "", xlab = j, ylab = "", ylim = c(0,35))

When the model is good, comment out those parts of transformed data and generated quantities, put matrix[N, T] y back into the data block, and push.

3 Likes

The design Sean and I used is not incidental. It allows the prior predictive draws to be saved and then inspected for prior verification after the fact. If the only goal is SBC then the process can be streamlined by using the transformed data block, but given the poor priors on these models I argue that prior examination through analysis of the prior predictive distribution would be helpful in going through the example models.

I’m pretty agnostic on how to do this aside from some concerns about efficiency.

It’d be nice if we could base it all on reusable SBC code. But that code needs to provide some kind of automated checks. The answer can’t be to dump out 20,000 histograms for human inspection. I’d rather have weak tests (that fail to reject in some situations) rather than ones with too many false alarms.

I think the plan is for this all to go into some big automated test procedure that might find some problems. Lack of failure != acceptance, as usual.

2 Likes

I made a PR to rstan to add a SBC function and a vignette. @betanalpha makes a good point that you may want to look at the prior and / or prior predictive distribution if you are in the process of establishing whether the Stan program is well calibrated, so I added to my conventions:

  • Realizations of unknown parameters in the transformed data block get copied in the generated quantities block to a vector named pars_
  • Realizations of the prior predictive distribution in the transformed data block can get copied in the generated quantities block to an object with the same dimensions named y_

Go ahead and push changes that you all want to see onto that branch. Also, @jonah should improve the plot method.

2 Likes