Using narrower priors for SBC

ADMIN EDIT: Moved this to a new topic, as I believe this is of general interest (by @martinmodrak)

I always wanted discuss the need for narrower SBC prior compared to general modeling situations. This results from the fact that SBC does not include any real world data that could mitigate the prior effect.

Do you think this is a disadvantage of SBC or a measure that makes calibration conservative but more robust?

Also, what is your opinion on leaving out fits with bad diagnostics? These would correspond to extreme prior parameter and simultaned y samples in most cases. It seems somewhat practical but I am worried this intervention is too subjective and we do not have clear standard on which fit to leave out.


Each statistical computation benchmark is a specific posterior distribution with validated posterior expectation values not a generic model.

SBC tests the consistency of a computational method of an entire ensemble of posterior distributions generated from prior predictive data which (usually, although not always) requires that the computational method fits all of those posteriors well. If the priors are wide then more and more posterior behaviors will have to be accommodated which may frustrate the given computational method even if some of the posteriors distributions fit fine.

I think this an important issue in using SBC in practical applications.

In my current thinking, the the underlying problem is that we are unable to express good priors. In most cases when I see a dataset in my SBC that generates divergences or has other issues, I can say “this is a-priori not a reasonable dataset” (e.g. all responses being 0 in a Poisson model). In this sense my prior does not pass a prior predictive check.

However, when I try to enforce that only “good” datasets are generated by tightening the priors on parameters, the priors often end up unrealistically narrow. This hints that my prior information cannot be expressed by putting independent priors on each parameter. And while I know that some people are working on methods for expressing joint priors on all parameters, I think this stuff is currently pretty hard to do well. So I know my prior is bad, but I also don’t know how to easily make it better. I am usually also often quite sure that the wider prior - despite giving trouble to SBC - is enough regularization to fit real datasets. What I am left with are heuristics and shortcuts - which I don’t think is bad, but one has to be aware that they are no longer on such a solid ground.

There are multiple available heuristics, I’ve personally tried:

  • Use the wider prior for SBC and ignore the divergences. If most fits are OK and the issues are not very serious (e.g. Rhat and ESS is good, but I have a single-digit number of divergences), I usually just ignore it. Even those small problems will most certainly introduce slight bias in the SBC, but my hunch is that the bias will be much smaller than the resolution I have for any realistic number of simulated datasets I am able to fit. However, when the problems are more common/serious, the SBC may be heavily affected.
  • Use an unrealistically narrow prior for the SBC that makes the problems in fitting go away. I have to rely on the assumption that a good computation with narrow prior translates to good computation with wider priors (at least for the cases where my diagnostics are good). Usually I don’t think this is such a problematic assumption - many bugs/bad parametrizations/… are likely to manifest even with a narrower prior. Also passing SBC with the narrow prior is a much stronger guarantee of the model working than not doing SBC at all.
  • Rejection sampling. I formulate my a-prior knowledge about what “good” data looks like as a binary condition (e.g. that both minimum and maximum categories are observed in an ordinal model, or a minimal variance of the total response). I then reject all generated datasets that fail the check and generate new ones. This procedure can be understood as adding a (probably weird) joint prior on all the parameters. EDIT: Turns out this does not invalidate SBC (see @nhuurre’s post below) But now my Stan model has different prior than what I use to generate the data and the SBC should fail given enough runs. However when the rejections are not frequent, I’ve found that the resulting SBC looks often very good. Anyway, tracking how many rejections you need to generate a single dataset is a good diagnostic.

It actually never ocurred to me just drop the non-converging runs from the final diagnostics as @hyunji.moon suggests - that could also be sensible (especially with the new visualisations that Aki mentioned recently that don’t rely on binning). (and as @nhuurre points out below, should not invalidate SBC for the converging subset)

In practice, I’ve often found myself going through all of the steps: adding rejection sampling, then narrowing down the prior somewhat to avoid unreasonable number of rejections and then tolerating a few percent of fits that don’t converge.

Very interested in what other SBC practitioners have to say about this.


This is something that has also concerned me. Generally I have rejected fits with <0.01 effective samples / sample, simply because even though they could be used (i.e. with more samples and corresponding thinning factor), the required computational resources make it impractical. However, as @martinmodrak points out this kind of rejection sampling modifies the priors used in SBC, so I agree that it should not be something that happens too frequently.

Thanks for the clarification, it now makes sense to me why narrower priors are needed for the models even though the fits are well-behaved for specific posteriors.

Problem-causing fits usually starts from extreme values of \tilde{y}. The following plot from this issue shows sometimes simulated datasets (green, y_sim#7 and orange, y_sim#20) are at least a hundred times bigger than the real-world dataset (red).

Rejecting them is computationally beneficial as I observed those extreme fits (y_sim#7 for instance) could take 100 times longer, but I think we should keep in mind this additional action hurts the theoretical support from the SBC uniformity proof.


Rejection sampling doesn’t have to be a problem.
Let f\left(y\right) be the probability that the simulated dataset y is rejected (usually a 0-1 function if you have a clear idea what a “bad” dataset looks like, but could be probabilistic if you’re relying on finicky diagnostics). The important numbers are the probability of rejection for parameter \theta

L\left(\theta\right)=\int f\left(y\right)\pi\left(y|\theta\right)\mathrm{d}y

and the total rate of rejections from the prior

R=\iint f\left(y\right)\pi\left(y|\theta\right)\pi\left(\theta\right)\mathrm{d}y\mathrm{d}\theta=\int L\left(\theta\right)\pi\left(\theta\right)\mathrm{d}\theta

Rejecting the parameter draw when it generates a “bad” dataset effectively distorts the prior


and of course rejections change the generating distribution


but crucially these changes cancel out when computing the posterior


Despite the rejections the fits target the correct posteriors and you can still expect the rank histogram to be uniform.


I admit that this felt wrong to me. So just to check my understanding is right:

Before rejections we have:

\pi(\theta | y) \propto \pi(y | \theta) \pi(\theta)

After rejections we have

\pi(\theta | y) \propto \frac{L(\theta)}{R} \pi(y | \theta) \frac{f(y)}{L(\theta)} \pi(\theta) = \frac{f(y)}{R} \pi(y | \theta) \pi(\theta)

And since \frac{f(y)}{R} is a constant, the overall posterior is the same whether we take rejection into account or not.

Also I did check empirically with a simple example and indeed, despite heavy rejections the ranks are uniform… So my intuition was once again proven bad :-) Now I can’t explain why I thought it was a problem in some of my SBC runs, where heavy rejections were accompanied by bad histograms…

Anyway, thanks for clarifying!

Little example to test

stan_code <- "
data {
   int<lower=0> N;
   real y[N];

parameters {
   real mu;

model {
   mu ~ normal(0, 2);
   y ~ normal(mu, 1);

generated quantities {
   real mu_ = normal_rng(0, 2);
   real y_[N];
   for(n in 1:N) {
      y_[n] = normal_rng(mu, 1);

n_datasets = 1000 # total number of simulated parameter sets from theta
N = 10
thin = 3

data <- list(N=N, y=as.vector(1:N))

model <- cmdstan_model(write_stan_file(stan_code))

sbc_obj <- SBCModel$new(name = "simple_test", stan_model = model)

hyperpriors <- list("mu"=function(){rnorm(1, 0, 2)})

theta_prior <- sbc_obj$sample_theta_tilde(list("mu"), n_datasets, hyperpriors)

sampled_y <- sbc_obj$sample_y_tilde(theta_prior , data=data)
theta_post <- sbc_obj$sample_theta_bar_y(sampled_y, data=data, pars=list("mu"), fit_iter = 200 * thin)
rank <- calculate_rank(theta_prior, theta_post, thin = thin)  # thinning factor of 3
plot_hist(rank, "mu")

reject <- rowSums(sampled_y) < 0

rank_reject <- calculate_rank(theta_prior[!reject, , drop = FALSE ], theta_post[!reject,,, drop = FALSE ], thin = thin)  # thinning factor of 3
plot_hist(rank_reject, "mu")

SBC is a special type of Bayesian calibration which considers posterior behavior over the entire prior predictive distribution. If the prior predictive distribution includes extreme values that you don’t think are reasonable then some part of your model is in tension with your domain expertise (otherwise you wouldn’t be able to say that anything is “extreme”). In this case the tension could arise from an overly-diffuse prior model that admits extreme model configurations or it could arise from an overly-diffuse observational model.

Truncating the observational model at a fixed threshold would remove the extreme observations and introduce only a modified normalizing constant which can largely be ignored, including in SBC. But truncations are a rather crude way to modify the observational model and can obfuscate the important reasons why the such extreme observations were supported in the first place.

Long story short – “extreme simulations” are not a problem inherent to SBC but rather to modeling. There’s no point in evaluating Bayesian calibrations with respect to a model until you’ve investigating the consequences of the assumptions encoded in that model. This is why prior checking comes from in my recommended workflow, Towards A Principled Bayesian Workflow.

1 Like