Using narrower priors for SBC

ADMIN EDIT: Moved this to a new topic, as I believe this is of general interest (by @martinmodrak)

I always wanted discuss the need for narrower SBC prior compared to general modeling situations. This results from the fact that SBC does not include any real world data that could mitigate the prior effect.

Do you think this is a disadvantage of SBC or a measure that makes calibration conservative but more robust?

Also, what is your opinion on leaving out fits with bad diagnostics? These would correspond to extreme prior parameter and simultaned y samples in most cases. It seems somewhat practical but I am worried this intervention is too subjective and we do not have clear standard on which fit to leave out.


Each statistical computation benchmark is a specific posterior distribution with validated posterior expectation values not a generic model.

SBC tests the consistency of a computational method of an entire ensemble of posterior distributions generated from prior predictive data which (usually, although not always) requires that the computational method fits all of those posteriors well. If the priors are wide then more and more posterior behaviors will have to be accommodated which may frustrate the given computational method even if some of the posteriors distributions fit fine.

I think this an important issue in using SBC in practical applications.

In my current thinking, the the underlying problem is that we are unable to express good priors. In most cases when I see a dataset in my SBC that generates divergences or has other issues, I can say “this is a-priori not a reasonable dataset” (e.g. all responses being 0 in a Poisson model). In this sense my prior does not pass a prior predictive check.

However, when I try to enforce that only “good” datasets are generated by tightening the priors on parameters, the priors often end up unrealistically narrow. This hints that my prior information cannot be expressed by putting independent priors on each parameter. And while I know that some people are working on methods for expressing joint priors on all parameters, I think this stuff is currently pretty hard to do well. So I know my prior is bad, but I also don’t know how to easily make it better. I am usually also often quite sure that the wider prior - despite giving trouble to SBC - is enough regularization to fit real datasets. What I am left with are heuristics and shortcuts - which I don’t think is bad, but one has to be aware that they are no longer on such a solid ground.

There are multiple available heuristics, I’ve personally tried:

  • Use the wider prior for SBC and ignore the divergences. If most fits are OK and the issues are not very serious (e.g. Rhat and ESS is good, but I have a single-digit number of divergences), I usually just ignore it. Even those small problems will most certainly introduce slight bias in the SBC, but my hunch is that the bias will be much smaller than the resolution I have for any realistic number of simulated datasets I am able to fit. However, when the problems are more common/serious, the SBC may be heavily affected.
  • Use an unrealistically narrow prior for the SBC that makes the problems in fitting go away. I have to rely on the assumption that a good computation with narrow prior translates to good computation with wider priors (at least for the cases where my diagnostics are good). Usually I don’t think this is such a problematic assumption - many bugs/bad parametrizations/… are likely to manifest even with a narrower prior. Also passing SBC with the narrow prior is a much stronger guarantee of the model working than not doing SBC at all.
  • Rejection sampling. I formulate my a-prior knowledge about what “good” data looks like as a binary condition (e.g. that both minimum and maximum categories are observed in an ordinal model, or a minimal variance of the total response). I then reject all generated datasets that fail the check and generate new ones. This procedure can be understood as adding a (probably weird) joint prior on all the parameters. EDIT: Turns out this does not invalidate SBC (see @nhuurre’s post below) But now my Stan model has different prior than what I use to generate the data and the SBC should fail given enough runs. However when the rejections are not frequent, I’ve found that the resulting SBC looks often very good. Anyway, tracking how many rejections you need to generate a single dataset is a good diagnostic.

It actually never ocurred to me just drop the non-converging runs from the final diagnostics as @hyunji.moon suggests - that could also be sensible (especially with the new visualisations that Aki mentioned recently that don’t rely on binning). (and as @nhuurre points out below, should not invalidate SBC for the converging subset)

In practice, I’ve often found myself going through all of the steps: adding rejection sampling, then narrowing down the prior somewhat to avoid unreasonable number of rejections and then tolerating a few percent of fits that don’t converge.

Very interested in what other SBC practitioners have to say about this.


This is something that has also concerned me. Generally I have rejected fits with <0.01 effective samples / sample, simply because even though they could be used (i.e. with more samples and corresponding thinning factor), the required computational resources make it impractical. However, as @martinmodrak points out this kind of rejection sampling modifies the priors used in SBC, so I agree that it should not be something that happens too frequently.

Thanks for the clarification, it now makes sense to me why narrower priors are needed for the models even though the fits are well-behaved for specific posteriors.

Problem-causing fits usually starts from extreme values of \tilde{y}. The following plot from this issue shows sometimes simulated datasets (green, y_sim#7 and orange, y_sim#20) are at least a hundred times bigger than the real-world dataset (red).

Rejecting them is computationally beneficial as I observed those extreme fits (y_sim#7 for instance) could take 100 times longer, but I think we should keep in mind this additional action hurts the theoretical support from the SBC uniformity proof.


Rejection sampling doesn’t have to be a problem.
Let f\left(y\right) be the probability that the simulated dataset y is rejected (usually a 0-1 function if you have a clear idea what a “bad” dataset looks like, but could be probabilistic if you’re relying on finicky diagnostics). The important numbers are the probability of rejection for parameter \theta

L\left(\theta\right)=\int f\left(y\right)\pi\left(y|\theta\right)\mathrm{d}y

and the total rate of rejections from the prior

R=\iint f\left(y\right)\pi\left(y|\theta\right)\pi\left(\theta\right)\mathrm{d}y\mathrm{d}\theta=\int L\left(\theta\right)\pi\left(\theta\right)\mathrm{d}\theta

Rejecting the parameter draw when it generates a “bad” dataset effectively distorts the prior


and of course rejections change the generating distribution


but crucially these changes cancel out when computing the posterior


Despite the rejections the fits target the correct posteriors and you can still expect the rank histogram to be uniform.


I admit that this felt wrong to me. So just to check my understanding is right:

Before rejections we have:

\pi(\theta | y) \propto \pi(y | \theta) \pi(\theta)

After rejections we have

\pi(\theta | y) \propto \frac{L(\theta)}{R} \pi(y | \theta) \frac{f(y)}{L(\theta)} \pi(\theta) = \frac{f(y)}{R} \pi(y | \theta) \pi(\theta)

And since \frac{f(y)}{R} is a constant, the overall posterior is the same whether we take rejection into account or not.

Also I did check empirically with a simple example and indeed, despite heavy rejections the ranks are uniform… So my intuition was once again proven bad :-) Now I can’t explain why I thought it was a problem in some of my SBC runs, where heavy rejections were accompanied by bad histograms…

Anyway, thanks for clarifying!

Little example to test

stan_code <- "
data {
   int<lower=0> N;
   real y[N];

parameters {
   real mu;

model {
   mu ~ normal(0, 2);
   y ~ normal(mu, 1);

generated quantities {
   real mu_ = normal_rng(0, 2);
   real y_[N];
   for(n in 1:N) {
      y_[n] = normal_rng(mu, 1);

n_datasets = 1000 # total number of simulated parameter sets from theta
N = 10
thin = 3

data <- list(N=N, y=as.vector(1:N))

model <- cmdstan_model(write_stan_file(stan_code))

sbc_obj <- SBCModel$new(name = "simple_test", stan_model = model)

hyperpriors <- list("mu"=function(){rnorm(1, 0, 2)})

theta_prior <- sbc_obj$sample_theta_tilde(list("mu"), n_datasets, hyperpriors)

sampled_y <- sbc_obj$sample_y_tilde(theta_prior , data=data)
theta_post <- sbc_obj$sample_theta_bar_y(sampled_y, data=data, pars=list("mu"), fit_iter = 200 * thin)
rank <- calculate_rank(theta_prior, theta_post, thin = thin)  # thinning factor of 3
plot_hist(rank, "mu")

reject <- rowSums(sampled_y) < 0

rank_reject <- calculate_rank(theta_prior[!reject, , drop = FALSE ], theta_post[!reject,,, drop = FALSE ], thin = thin)  # thinning factor of 3
plot_hist(rank_reject, "mu")

SBC is a special type of Bayesian calibration which considers posterior behavior over the entire prior predictive distribution. If the prior predictive distribution includes extreme values that you don’t think are reasonable then some part of your model is in tension with your domain expertise (otherwise you wouldn’t be able to say that anything is “extreme”). In this case the tension could arise from an overly-diffuse prior model that admits extreme model configurations or it could arise from an overly-diffuse observational model.

Truncating the observational model at a fixed threshold would remove the extreme observations and introduce only a modified normalizing constant which can largely be ignored, including in SBC. But truncations are a rather crude way to modify the observational model and can obfuscate the important reasons why the such extreme observations were supported in the first place.

Long story short – “extreme simulations” are not a problem inherent to SBC but rather to modeling. There’s no point in evaluating Bayesian calibrations with respect to a model until you’ve investigating the consequences of the assumptions encoded in that model. This is why prior checking comes from in my recommended workflow, Towards A Principled Bayesian Workflow.


A limitation of SBC is the requirement of a generative model: the prior π(θ) should be a proper distribution and one should be able to sample from the joint model π(θ, y). These conditions apply in many but not all Bayesian models in practice. Another concern is the behavior when the prior is very weak relative to the likelihood, so that a huge number of replications would be needed to test the computation in the relevant region of parameter space. This could cause problems if the computation works well for some parameter values but not for others.

As discussed in our workflow paper, it is common in practice to use weak priors, out of some combination of respect for non-Bayesian tradition and concern about missing important data features by oversmoothing. When priors are very weak, prior predictive simulation can result in datasets that result in computational challenges (for example, discrete data that are nearly all zeroes) which in turn can cause problems with SBC even if the algorithm in question would perform well on more realistic datasets. More generally, there is a tension between the goal of using SBC—or any validation procedure—to test a general algorithm, and the goal of checking that it works well for a particular example.


I’m quite late to this and my question is probably elementary, but what are typical models for which “too wide” priors lead to computational problems (for some but not all simulations)?Or what is the simplest one?

I’m sure there are better and more nuanced answers, but one answer is “literally any model”. Any model will eventually encounter numerical issues if the prior is too wide.

Here’s a quick example: Poisson regression where there’s a wide prior on the intercept. Depending on the data and the other parts of the model, if the intercept is low enough, the model will with high probability produce data that are all zeroes, and then it’s hard to estimate anything.


Just realized one thing that is probably worth adding to the thread for people reading it in the future, despite it possibly being obvious for many:

Rejection sampling is OK (i.e. only changes the normalizing constant of the posterior), if the probabality of rejection f(y) depends solely on observed data y. If there is a dependence on the parameters \theta (beyond the indirect influence \theta has on y), the change to the posterior is no longer just a constant (and SBC will signal an issue).


This is very useful info!!! One of the biggest challenges for me with SBC is that the weak priors generate many “near-outlier” data values that NUTS struggles with, but inference is perfectly fine with more reasonable data values.

Just to double check, I can run Simulation Based Calibration by

  1. Running N simulations
  2. Rejecting any simulations based on any arbitrary standard f(y) => accept/reject, eg reject any simulation with y < 0 or y > 10.
  3. Run inference on each of the remaining, non-rejected simulations
  4. Analyze the resulting rank histograms among the non-rejected simulations

I’m curious - the original SBC paper said:

A popular alternative… is to define a ground truth ̃θ, simulate data from that ground truth, ̃y∼π(y| ̃θ), and then quantify how well the computed posterior recovers the ground truth in some way. Unfortunately this approach is flawed…

Just to make sure I understand the subtleties here… selecting simulations based on θ is BAD and invalidates SBC, but selecting simulations on y is OK and does not invalidate the SBC results?

I assume that limiting simulations to a subset of y values limits the assurance of computational accuracy to just that subset of y.


1 Like

That is IMHO reasonable. Currently working with others on revamp to the SBC package for R to make all of this easier (expect some announcements soon).

Yes. Although I think the original quote is concerned not with selecting simulations based on \theta, but with the fact that simulating datasets for any particular \theta value give only limited information and to get useful guarantees one needs to average over the whole prior for \theta (e.g. simulate from the prior as in SBC).

Yes, exactly. A practical corollary is that in fact, you can reject simulations that created divergences (or other fitting problems) as the probability of divergence is only a function of y. This way, you’ll still get a valid calibration check for all datasets that do not produce divergences - which is often all the datasets we care about, because if you get a divergence, you’ll know you can’t really trust the estimates anyway.

1 Like