Best practices for Simulation Based Calibration with hierarchical models

Hello everybody,

I’m currently studying Simulation Based Calibration and I have few questions that, as far as I’m concerned, I haven’t found clear answers.
I haven’t put my code as I’m just using known models (8 schools centered/non-centered) and I wanted to “refine” my post since it’s already quite heavy.

If after all my results you feel like it’d be relevant to have more details about my implementation, you can find the notebook I’m currently using here:

I’m comparing the results obtained through Simulation Based Calibration with the centered 8 schools and the non-centered 8 schools. I’m assessing them by checking the uniformity of the ranks using plots and also \chi^2 tests.
I compute the \chi^2 of each dimension per parameter separately as advised in , and in the end I’m averaging the \chi^2 obtained per dimension to obtain a global \chi^2 of each parameter.
My long term objective is to compare other methods (like smc samplers for example) against NUTS in Simulation Based Calibration, but I want to realise a fair judgment by using the best possible setting of Stan’s NUTS with SBC. The idea is to generalise the evaluation using SBC with models that may involve high dimensions in the future.


\chi^2 Evaluation between simulated and infered parameters (1=perfect correlation)

  • nl = number of simulated likelihood
  • ns = number of samples (mcmc steps \times nb of chains) per simulated likelihood


Left 500ns, 200nl 5000ns, 200nl 500ns, 1000nl 5000ns, 1000nl
\mu 0.53 0.35 0.012 0.31
\tau 0.21 0.10 0.0007 0.012
\theta 0.45 0.48 0.55 0.73


Left 500ns, 200nl 5000ns, 200nl 500ns, 1000nl 5000ns, 1000nl
\mu 0.97 0.18 0.70 0.02
\tau 0.90 0.72 0.52 0.64
\theta 0.39 0.49 0.56 0.58

Big concern:
In the paper aforementioned they show interpretations of histograms, like how to identify a “too narrow” variance of a certain prior for example. But I haven’t really found a rule of how many simulated likelihoods / samples should be used.
Also, I don’t even know if these results make sense because it feels odd to have detrimental results as I am increasing the number of samples.

Questions related to the parameters:
I don’t understand why, as I’m changing the values of the number of samples or simulated likelihood, the \chi^2 of the parameters are not always moving in the same direction. For example in the non-centered 8 schools, by increasing the number of samples I obtained a poorer \mu and \tau BUT a better \theta.

Questions related to Informative data :
I’ve read in Centered vs. non-centered parameterizations, that :
“centered actually works better when you have informative data (large N relative to σ ) for a particular group, while non centered is better for uninformative data (small N relative to σ ) for a particular group.“
Is it true, and if so, why am I not observing those results ?

Questions related to thinning:
All these results are realised using NUTS and without thinning, the reason is because whilst applying it, I obtained spikes at the boundaries (correlation between samples) ? Which doesn’t make sense to me, does it mean that I must have an issue inside my thinning process ?
And is it really relevant to apply thinning on NUTS since I assume there are less correlations than in a regular MCMC.?

Thank you very much for all the people that will spend time reading this.
I’m writing this post because after reading lots of documentation of Stan, I still have the feeling of swimming in an ocean of weirdness with these results.


The \chi^{2} test in the Cook, Gelman, and Rubin paper has some fundamental mathematical flaws. Many of your questions are addressed in the Simulation-Based Calibration paper,, that introduces a more robust approach that resoles those issues.


Thank you so much for your quick response !

You’re right the majority of my concerns related to the choice of the different ratios of samples/steps are addressed, thank you !

Indeed I suspected that the problem might be from the \chi^2 test …

Would you think the main bottleneck of it’s application coupled with MCMC is the number of samples needed to obtain a “good enough” estimation of the true values (asymptotically approached by the empirical CDF values), or the autocorrelations ?

From what I understood so far, there is no good “metric” to assess the model then (other than the histograms), which can be problematic for high dimensional applications.

It appears that my implementation was using the average ranking (proposed by that you mention in the end of your conclusion, which seems to be consistent, but the bottleneck remains the need of a global summary.

Would be aware of any material investigating interesting summaries that identify a deviation from uniformity ?

1 Like

I have seen several cases of misleading global summaries but if needed, I think more than one test should be used ( e.g. Kolmogorov–Smirnov test). I am not sure whether absolute standard that dictates pass or fail could exist but it might be a good approach to compare distance with the uniform based on diverse metrics such as Wasserstein, or maximum deviation.

Generally, graphical inspections are recommended including ecdf plots from the sbc paper. I recommend SBC part in Stan manual, especially this which shows the analysis on both ill-specified model and sampler suffering from difficult posterior geometry.


Here are the codes that report pvalue, max difference, and wasserstein metric.

Two things for discussion:

  1. is the naming “wasserstein” appropriate for comparing two discrete distribution, normalized bin_counts and uniform? Thm from this for R^1:
    R(P, Q)=\sup _{B \in \mathfrak{B}}|P(B)-Q(B)| = \int_{-\infty}^{\infty}|F(x)-G(x)| d x

  2. could pvalue be transformed to a more proper distance measure? Would this attempt meaningful? If so, what conditions should be satisfied (e.g.triangular inequality)?
    I’m not sure it might be relevant but this paper suggests deriving distance matrix from pvalues based on multidimensional scaling. Similar question has been raised.
    The plot illustrates higher pvalues might be more deviated (seems to me) from uniformity.
    tagging @betanalpha, @torkar, @bnicenboim, @Dashadower who might be interested.


SBC doesn’t require that that the estimators are exact, just that they’re unbiased. Consequently the only challenge in implementing SBC for Markov chain Monte Carlo is removing the autocorrelations as much as possible. Increasing the number of samples just makes the SBC assessment more sensitive to potential problems.

To be clear SBC assesses the accuracy of a sampling produce, and implicitly posterior expectation value estimation, within the context of a specified model, not the model itself. The method says nothing about whether the specified model is useful in any given application.

The problem is that there is no single deviation from uniformity. The rank histogram can deviation in many different ways, and each distinct deviation says something different about the nature of the problem. Even if a single summary/test is designed to capture interpretable deviations, such as those discussed in the paper, it will largely ignore other possible deviations.

Many uniformity tests, for example Kolmogorov-Smirnov, are based on statistics that don’t correspond to any particularly interpretable deviation in the SBC case, and hence aren’t all that useful for automated testing. We considered trying to come up with template-like tests to capture the smiles/frowns/tilts discussed in the paper but ultimately the rank histogram was the most information dense way of presenting the results.

At the same time recall that even in high-dimensional models there are often only a few parameters/summaries of interest to the final application, and SBC is much more productive when those parameters are prioritized instead of trying to test every parameter at once.

In my opinion Wasserstein is most usefully interpreted as an integral probability metric that bounds differences the expectation values of certain sets of functions instead of just differences of probabilities. See for example the discussion in


I think there is a misunderstanding here due to the fact that I haven’t provided enough information about the context. My long term objective is to compare methods applied to hierarchical models.
I understand that it can be misleading since I’m deviating the original aim of SBC; my objective is not really to tune the parameters based on their calibration on the posterior, but rather to evaluate methods used with these parameters on the model.

This is why I’m not very concerned about having interpretable deviations, therefore the benefits of using rank histograms is dramatically reduced.

I will try to use several Integral Probabilistic Metrics and look at their consistencies with my averaged rank histograms to assess their relevance.
Thank you for sharing this interesting case of study, however even though I agree with you that Wasserstein bounds differences of expectation values rather than probabilities, I think it could still remain a sustainable metric even between probabilities (or here ‘ranks’), or is there a key factor that I’m missing here (assuming the MCMC satisfies a CLT) ?

Unfortunately I don’t understand your goal so I can’t offer much constructive feedback.

Integral probability metrics define a sense of distance between distributions, in the particular context of Bayesian computation the distance between an approximate posterior distribution and the exact posterior distribution. Integral probability metrics can be interpreted in multiple ways, in some cases as deterministic bounds on differences in probabilities and in others as bounds on differences in expectation values. In most cases these two perspectives are dual to each other, but the expectation value perspective is often the most closely related to practice.

Central limit theorems bound how the evolving approximate posterior distribution constructed from a Markov chain converges to the exact posterior distribution as the length of the chain increases (at least in nice circumstances). In contrast to integral probability metrics these bounds are stochastic and not deterministic and they’re not uniform over a set of functions but rather vary from function to function.

Simulation based calibration doesn’t define a distance between an approximate posterior distribution and an exact posterior distribution. Instead it compares an ensemble of approximate posterior distributions and an ensemble of exact posterior distributions derived from a common prior model. Consistency is guaranteed if each paired comparison is close, but technically the ensemble might appear consistent even if some paired comparisons are not. Consequently rank uniformity doesn’t immediately tie into bounds on any of the individual paired comparisons.

If by “methods” you mean different computational algorithms then you have to consider what kind of comparison are possible. Explicit bounds on most integral probability metrics for a given target distribution can’t be derived outside of very simple cases, and they typically aren’t amenable to numerical calculation either. One can construct empirical errors if enough exact expectation values are known, but that too is often unrealistic. The advantage of simulation based calibration is that it doesn’t require knowing the structure of the true posterior distribution – the disadvantage is that the results hold within the scope of a certain class of distributions and the results themselves are more qualitative than quantitative.

For some more discussion on the general challenges of validating probabilistic computational algorithms see

Thanks for this note. Some questions on this figure! @betanalpha

  1. Could you please give some examples of the three: theoretical, empirical validation, and empirical extension of theoretical validation? For theoretical, you mentioned a simple target distribution with provable error bound. The only example I could only think of was normal distribution and Laplace approximation as a distribution-algorithm pair which might lead to zero error. Is this a valid example? And I wish to make sure that viewing Laplace approximation (and other approximation schemes including VI) as an algorithm is ok.

  2. Could @avehtari’s comment on posteriordb project extended as an attempt to widen the territory of empirical validation? In terms of crowdsourcing the fit of posterior and algorithm.

  1. Is SBC under the category of empirical extension of theoretical validation? For example, based on the theoretical validation provided by uniformity proof from SBC paper under strict conditions (e.g. conditional independence of posterior and prior samples given the data), we are exploring the space by perturbing the conditions one by one? I hope there could be some measurable and continuous axis along which the space could be explored both in terms of target distribution and algorithm. Combination of prior parameter with its shape was quite chaotic for one; has there been any attempts to order prior sets? If a distribution which could represent all CDF exists, it might be reasonable to fix the distribution and perturb only its parameter.

Theoretical would include explicit integral probability metric bounds (for example total variation distance or Wasserstein-2 distance) on the convergence of a Markov chain N-step distribution and the target distribution. It would also include explicit verification that a central limit-theorem exists for a given Markov transition and target distribution pair. Typically results like these are limited to low-dimensional target distributions or target distributions specified by strongly log-convex density functions.

Empirical would include the stat_comp_benchmark repository,

Empirical extension of theoretical validation would include working out what conditions obstruct a Markov chain Monte Carlo central limit theorem and how those conditions manifest is sampler behavior so that they can be turned into empirical diagnostics. This is for example how the Hamiltonian Monte Carlo diagnostics are motivated. For work on pushing variational Bayes in this direction see

More generally a theoretical analysis would try to bound the finite error for more complicated target distributions, although in practice the theory can support only so much complexity.

Yes, any probabilistic computational algorithm would fit into these scheme included Monte Carlo, importance sampling, Laplace, Variational Bayes, and more.

In theory yes, although the crowdsourcing aspect can be problematic. The problem is that for empirical validation to be useful the true values have to be accurate, and engineering problems where accurate answers can either be worked out analytically or estimated numerically with sufficient guarantees is challenging.

A unified place where practitioners can find target distributions with validated expectation values is a very good idea, but if the validation of those expectations values is not consistent then users will either end up being lead astray by suspect results or waste time searching for exceptional entries with good enough validation. The hard work isn’t building a database but building a carefully validated database that users can trust.

I would classify SBC as empirical validation because it applies only to a fixed family of models and does not adapt to a given target distribution. For example even if an algorithm can be validated probabilistically for every posterior \pi(\theta \mid \tilde{y}) for all \tilde{y} \sim \int \mathrm{d} \theta \, \pi(y, \theta) \, \pi(\theta) there are no guarantees that the algorithm will return faithful results for \pi(\theta \mid \tilde{y}') where \tilde{y}' is not sampled from the prior predictive distribution.

One of the goals of a useful workflow is to build a model consistent with the observed data, in which case the corresponding SBC would provide as close of a validation of the observed posterior as possible, but the validation is still based on assuming that the target distribution is close enough to the scope of the empirical validation.

1 Like

This is just an intuition but I think in hierarchical models, hyperparameters (\tau, \mu) might not need to be as strictly checked as parameters (\theta). Hyperparameters being well recovered would be the necessary conditions for a good parameter recovery and propositions such as all parameters passing SBC test would guarantee good results for hyperparameters could be made.

So if your aim is to test the overall model, not how well a specific hyperparameter is recovered, it might be economical to test parameters only.

I wonder what other SBC people might think about this!