One of the challenges in discussion like these is the heterogeneity in terminology, including fundamental concepts like “inference” and “Bayesian” let alone less established terms like “calibration” or “Simulation-based calibration” the latter of which seems to be interpreted differently by everyone.

Let try to put everything into a common perspective and address some of the terminology variation along the way. I’ll largely be following along Probabilistic Modeling and Statistical Inference and Section 1.3 of Towards A Principled Bayesian Workflow, and you can consult those documents for further details.

Generally “inference” refers to quantifying some model configurations that are “consistent” with the observed data is some way. Now there is no canonical definition of “consistent” and so there is no canonical definition of “inference”; every different definition suggests a different approach to inference.

#
Inference

Frequentist inference doesn’t actually make any claim how to define “consistent”. Instead it buries that choice in the choice of *estimator* \hat{\theta}, a deterministic function that takes in observed data and returns an *estimate* that contains some choice of model configurations. For example a point estimator returns a single “consistent” model configuration and an interval estimator returns an entire interval of “consistent” model configurations.

Bayesian inference takes a stronger stand, applying probability theory to define “consistency”. This immediately results in the use of Bayes’ Theorem and the use of posterior distributions as “inferences”.

Given an observation then we are in some sense done. We can plug the observed data into an estimator to get an estimate or plug the observed data into a Bayesian model to get an entire posterior distribution.

#
Frequentist “Calibration”

Frequentist inference, however, doesn’t stop there. When there are so many choices of estimators one has to consider some criteria to narrow down the possibilities to something particularly useful. Frequentist methods quantify the performance of an estimator by considering how it behaves over a range of hypothetical observations drawn from some observational model. In other words it considers the how frequently the estimator does well and how frequently the estimator does poorly, hence the name “frequentist”.

More formally we need to define some observational model \pi(y; \theta) from which to draw data, hopefully one that well-approximates the true data generating process responsible for actual observations, and a utility function U(\hat{\theta}, \theta) that quantifies how well an estimator performs. Then for fixed “true” \theta we can define the performance as the expectation value

\bar{U}(\theta) = \int \mathrm{d} y \, \pi(y ; \theta) \, U( \hat{\theta}(y), \theta).

In terms of simulations we could generate

\tilde{y}_{s} \sim \pi(y ; \theta)
\\
\tilde{U}_{s} = U( \hat{\theta}(\tilde{y}_{s}), \theta),

and then average the \tilde{U}_{s} to approximate \bar{U}(\theta).

If there is more than one possible value for \theta then things become a bit more complicated because we have to find a way of aggregating the \bar{U}(\theta) into a single value without averaging which would be equivalent to using probability theory on the parameters. The typical approach is to look at extremes, reporting the worst case behavior

U^{*} = \min_{\theta} \bar{U}(\theta).

For example if we take U(\hat{\theta}, \theta) = \hat{\theta} - \theta then

\bar{U}(\theta) = \int \mathrm{d} y \, \pi(y ; \theta) \, \big( \hat{\theta} - \theta \big)

would define the estimator “bias” for each possible “truth” \theta. An unbiased estimator would satisfy \bar{U}(\theta) = 0 for all \theta which is possible only for relatively simple models \pi(y ; \theta).

Once the U^{*} or the individual \bar{U}(\theta) have been evaluated then they can also be used to choose some “best” estimator. Alternatively they can be used to evaluate different choices that lead to different observation models \pi(y ; \theta), such as the number of observations or the kinds of observations. This approach of using hypothetical, ensemble behavior to tune the configuration of a measurement is often referred to as *experimental design*.

This procedure is so ingrained into frequentist methodology that it’s often taken for granted. Consequently precise terminology for the process of using hypothetical, ensemble behavior to quantify estimator performance is often non-existent. I refer to this procedure as “calibration”, although some reserve “calibration” for the process of tuning estimators based on the output of the procedure. I’ve struggled to find a less ambiguous term for this procedure but so far no luck.

#
Bayesian “Calibration”

Bayesian inference doesn’t require the choice of estimator, but we might still be curious how the behavior of the posterior distribution varies across possible observations. To be clear some are philosophically opposed to this, restricting all consideration to only real observations, but this kind of analysis is ingrained all over industry and the sciences and I personally am very for it.

The posterior distribution could be dropped right into the frequentist evaluation framework if we could come up with some utility function that consumes not a single estimate but rather an entire probability distribution. For example one might define something that integrates over the entire posterior distribution,

U(\pi(\theta \mid \tilde{y}), \theta) = \int_{\theta}^{\infty} \mathrm{d} \theta \, \pi(\theta \mid \tilde{y}).

Alternatively one could first reduce the posterior distribution to a single estimate, such as the posterior mean or median, and then drop that into the frequentist framework.

One huge advantage of the Bayesian approach is that we have the full power of probability theory on our side. Once we construct \bar{U}(\theta) then we can average the values together instead of just looking at extremes. Averaging, however, requires a choice of probability distribution over \theta and a natural choice is the prior model,

\bar{U}(\theta) = \int \mathrm{d} y \, \pi(y ; \theta) \, U( \pi(\theta \mid y), \theta)
\\
\bar{U} = \int \mathrm{d} \theta \, \pi(\theta) \, \bar{U}(\theta),

or altogether

\begin{align*}
\bar{U}
&= \int \mathrm{d} \theta \, \pi(\theta) \int \mathrm{d} y \, \pi(y ; \theta) \, U( \pi(\theta \mid y), \theta)
\\
&= \int \mathrm{d} \theta \, \mathrm{d} y \, \pi(y ; \theta) \, \pi(\theta) \, U( \pi(\theta \mid y), \theta)
\\
&= \int \mathrm{d} \theta \, \mathrm{d} y \, \pi(y, \theta) \, U( \pi(\theta \mid y), \theta).
\end{align*}

We can estimate this integral using the Monte Carlo method and simulations of model configurations from the prior and data from the corresponding data generating process,

\tilde{\theta}_{s} \sim \pi(\theta)
\\
\tilde{y}_{s} \sim \pi(y ; \tilde{\theta}_{s})
\\
\tilde{U}_{s} = U( \hat{\theta}(\tilde{y}_{s}), \tilde{\theta}_{s}).

In fact instead of looking at the average of the U_{s} we can use the samples to quantify the entire *distribution* of utilities. This can be much more informative. For example by correlating \tilde{U}_{s} and \tilde{y}_{s} one can identify kinds of observations that lead to particularly pathological posterior behavior.

That said the prior model isn’t a necessity! One can average over any distribution of model configurations, for example distributions that concentrate on certain model configurations of interest or encode certain adversarial behaviors.

To summarize: quantifying inferential behavior over an ensemble of hypothetical observations is straightforward in Bayesian inference once we determine a utility function that consumes the entire posterior distribution. Alternatively if we reduce the entire posterior distribution to a point or interval estimate then we can just drop that into a frequentist evaluation. The huge benefit of Bayesian calibration is that it also allows us to incorporate variation in \theta and not be limited to worst case behavior.

#
“Simulated-Based Calibration”

“Simulation-Based Calibration” or “SBC” was introduced in [1804.06788] Validating Bayesian Inference Algorithms with Simulation-Based Calibration and is only awkwardly related to the above analyses. Unfortunately the name is terribly broad for what the method actually does, which has lead to no end of confusion.

The actual “SBC” method takes advantage of a Bayesian ensemble self-consistency condition that holds for *any* model,

\pi(\theta) = \int \mathrm{d} \theta' \, \mathrm{d} y \, \mathrm{d} \theta \, \pi(\theta' \mid y) \, \pi(y \mid theta) \, \pi(\theta).

In terms of simulations

\tilde{\theta}_{s} \sim \pi(\theta)
\\
\tilde{y}_{s} \sim \pi(y ; \tilde{\theta}_{s})
\\
\tilde{\theta}'_{s} \sim \pi(\theta \mid \tilde{y})

this self-consistently condition implies that the \tilde{\theta}'_{s} should be indistinguishable from prior samples. The actual method is a bit more complicated but essentially it compares the \tilde{\theta}'_{s} to the \tilde{\theta}_{s} in a careful way to construct a histogram that will always be uniform.

That is unless the simulations aren’t generated correctly. For example if the posterior samples \tilde{\theta}'_{s} \sim \pi(\theta \mid \tilde{y}) are erroneous then the SBC histogram will be skewed away from uniform. This provides a way to check for the accuracy of the computational method that generates posterior samples.

Critically if the posterior computation is good then the SBC method will return a null result *no matter the ensemble behavior of the posterior distributions*. SBC as introduced in that paper has no consideration as to the inferential performance of the model.

To avoid this confusion I refer to studies of the hypothetical ensemble behavior of an estimator or posterior distribution as “inferential calibration” and ensemble studies sensitive to computational problems as “algorithmic calibration”, although I’m pretty sure that no one else uses this terminology which leads to the term “calibration” being horrendously overloaded.

This is especially true in practice. Note that the simulations

\tilde{\theta}_{s} \sim \pi(\theta)
\\
\tilde{y}_{s} \sim \pi(y ; \tilde{\theta}_{s})

are used in both the algorithmic calibration “SBC” and the frequentist/Bayesian inferential calibrations that we discussed above. Both consider hypothetical ensemble behavior, but they do it for every different purposes.

#
Long Story Short

If you want to compare a Bayesian method to mostly frequentist methods then it’s probably easiest to just reduce the posterior distribution to a point estimator and apply the same frequentist calibration to all.

If you want to be a bit more sophisticated about the parameter dependence of the frequentist calibration then you could average the \bar{U}(\theta) over some relevant distribution, such as the prior model.

If you just want to compare different Bayesian models then a utility function that takes into account the entire posterior distribution can be especially useful.