How do I evaluate my model, in terms of bias, coverage, etc.?

Ok, I’ve spent a few days looking for the best practice of evaluating my model in this case.

With a frequentist MLE model, what I do is:

  1. simulate N datasets with known \theta,
  2. find the MLE for \theta.
  3. Calculate biases, variance, MSE, coverage…

Which I can then use to say something about how well my model behaves given parameter \theta.

Now, I tried replicating this procedure with a Bayesian model, but a few things I notice are:

  1. I need to find a prior which can be fit automatically, but simply using the \theta as prior feels like cheating because in the real world I wouldn’t know the actual \theta. I thought of fitting a simple glm in an attempt to replicate having prior knowledge on \theta, and using that as a prior distribution mean.
  2. MCMC produces a distribution, and not an actual point estimate.

So my workflow currently is:

  1. simulate N datasets with known \theta,
  2. find a prior for \theta using GLM
  3. Use the prior in stan, produce posterior distribution
  4. Calculate biases, variance, and MSE based on posterior mean. Calculate coverage based on posterior quantiles.

But I don’t see this workflow mentioned anywhere I looked, and it feels like there should be a more efficient way of doing this.

I read a bit on posterior predictive check, but it seems like it doesn’t do quite what I want, which is ultimately answering the question: “Given N data sets - how does this model perform in actually finding \theta?”

One thing I thought about doing is using stan’s optimization, which will produce a point estimate, and I can then calculate the hessian and get the SEs to use for coverage - Is this a reasonable approach?

1 Like

I think the most comprehensive reference here would be the Bayesian Workflow paper: [2011.01808] Bayesian Workflow

1 Like

Reading this, it seems I am interested in part 4 of the workflow - working with fake data.

What I am doing is basically an SBC, but instead of sampling from a prior distribution - I’m sampling from a single point prior. For example, a simulation would be something like:

sim <- replicate(1000, rpois(N, lambda)

Where I have the same lambda throughout all simulated data. If I understand correctly, an SCB will look more like:

lambda <- rnorm(1, lambda_mean, prior_sd) # defining a normal distribution prior just for simplicity here
sim <- rep(1000, rpois(N, lambda)

I understand how this better connects the posterior distribution with simulations, as the latter are derived from a distribution as well. And reading the Stan user guide, it seems like every iteration of the MCMC both creates fake data and fits the data, so this is way (way) more efficient than what I did previously.

So, my new Stan code will take this form?

transformed data {
 data generation process



generated quantities {
calculate bias/ etc.

Now it’s a question of my prior choice… What I am trying to do, is find the parameters that define the unobserved grey curve which is used from data generation.

My methodology now is using a glm to find the parameters for the red curve, and use these a priors. It’s not perfect but that’s the only way I can think of to simulate a real world scenario. So let’s say \beta is the estimated parameter from the red curve.

The issue I have is that, with not enough data, I will get posterior estimates which are simply my model priors, and hence in past iterations of this work, with real value \theta I got biases of 0 (when using \theta as model prior), \theta (when using 0 as prior mean), and lastly \beta - \theta (I think, need to double check) when using \beta as prior.

Ok, I translated the DGP to Stan and follow the instructions here:

functions {
  array [,] int simulate_introduction_rng(int N, int B_total, array [] int sampling_vector, real b0_sim, real b1_sim){
    array[N,7] int out_mat;
    vector[N] rate_sim;
    vector[N] prob_invasive;
    array[N] int samples;
    array[N] int yearly_introduced;
    array[N] int detected_i_species; 
    array[N] int undetected_i_before;
    array[N] int undetected_i_after;
    array[N] int detected_i_so_far;
    array[N] int detected_n_species;
    array[N] int undetected_n_before;
    array[N] int undetected_n_after;
    array[N] int detected_n_so_far;
    array[N] int t;
    for (i in 1:N) {
      t[i] = i;
      rate_sim[i] = exp(b0_sim + b1_sim * i);
    yearly_introduced = poisson_rng(rate_sim);
    detected_i_species  = rep_array(0, N);
    undetected_i_before = rep_array(0, N);
    undetected_i_after  = rep_array(0, N);
    detected_i_so_far   = rep_array(0, N);
    detected_n_species  = rep_array(0, N);
    undetected_n_before = rep_array(0, N);
    undetected_n_after  = rep_array(0, N);
    detected_n_so_far   = rep_array(0, N);
    undetected_i_before[1] = yearly_introduced[1];
    undetected_n_before[1] = B_total;
    prob_invasive[1] = 1.0 * undetected_i_before[1] / (undetected_n_before[1] + undetected_i_before[1]);
    samples[1] = binomial_rng(sampling_vector[1], prob_invasive[1]);
    detected_i_species[1] = samples[1];
    detected_n_species[1] = sampling_vector[1] - samples[1];
    undetected_i_after[1] = undetected_i_before[1] - detected_i_species[1];
    undetected_n_after[1] = undetected_n_before[1] - detected_n_species[1];
    detected_i_so_far[1] = cumulative_sum(detected_i_species)[1];
    detected_n_so_far[1] = cumulative_sum(detected_n_species)[1];
    for (i in 2:N){
      undetected_i_before[i] = yearly_introduced[i] + undetected_i_after[i-1];
      undetected_n_before[i] = undetected_n_after[i-1];
      prob_invasive[i] =  1.0 * undetected_i_before[i] / (undetected_n_before[i] + undetected_i_before[i]);
      samples[i] = binomial_rng(sampling_vector[i], prob_invasive[i]);
      detected_i_species[i] = min(samples[i], undetected_i_before[i]);
      detected_n_species[i] = min((sampling_vector[i] - samples[i]), undetected_n_before[i]);
      undetected_i_after[i] = undetected_i_before[i] - detected_i_species[i];
      undetected_n_after[i] = undetected_n_before[i] - detected_n_species[i];
      detected_i_so_far[i] = cumulative_sum(detected_i_species)[i];
      detected_n_so_far[i] = cumulative_sum(detected_n_species)[i];
    out_mat[,1] = t;
    out_mat[,2] = yearly_introduced;
    out_mat[,3] = detected_i_species;
    out_mat[,4] = detected_i_so_far;
    out_mat[,5] = undetected_n_before;
    out_mat[,6] = detected_n_species;
    out_mat[,7] = detected_n_so_far;
    return out_mat;

data {
  int <lower = 1> N;  // number of rows in the data
  int <lower = 1> B_total; // assumed number of ind in group B
  array[N] int <lower = 1> sampling_vector; // number of 
  real b0_mu; // prior for betas
  real b1_mu; // prior for betas

transformed data {
  real b0_sim = normal_rng(b0_mu, 0.1); // params for data creation
  real b1_sim = normal_rng(b1_mu, 0.001);   // params for data creation
  array[N,7] int <lower = 0> sampling_mat;
  sampling_mat = simulate_introduction_rng(N, B_total, sampling_vector, b0_sim, b1_sim);

parameters {
  real b0;
  real b1;

transformed parameters {
  vector<lower = 0>[N] unrecorded_invasives;
  vector[N] rate;
  for (i in 1:N){
    rate[i] = exp(b0 + b1 * i);
    unrecorded_invasives[i] = cumulative_sum(rate)[i] - sampling_mat[i,4];

  for (i in 1:N){
    sampling_mat[i,3] ~ beta_binomial(sampling_vector[i], unrecorded_invasives[i], sampling_mat[i,5]);
  b0  ~ normal(b0_mu, 0.1);
  b1  ~ normal(b1_mu , 0.001);

generated quantities {
  array[2] int <lower = 0, upper = 1> lt_sim
      = { b0 < b0_sim , b1 < b1_sim };

I’m left with a rank for this specific run. So, the next step would be something like this:

data_scb <- list(
  N = 150, 
  sampling_vector = rpois(150, 10),
  B_total = 1000,
  b0_mu = 1,
  b1_mu = 0.01

n_trials = 30
ranks <- matrix(NA, n_trials , 2)
for (n in 1:n_trials ){
  samples <- scb_model$sample(data = data_scb, 
                        chains = 3, 
                        parallel_chains = 3, 
                        iter_warmup = 200,
                        iter_sampling = 1000, thin = 2,
                        show_messages = F, refresh = 0)
  specific_rank <- as_draws_rvars(samples$draws(variables = "lt_sim"))
  specific_rank <- draws_of(specific_rank $lt_sim)
  ranks[n,] <- colSums(specific_rank )

If I understand, then I am hoping to get a uniform distribution for the ranks, between 0 and number of iterations. That indicates that the model “works”.

But still, it seems like it doesn’t answer my question, does it?

But the more I think about it, these simulations should assume my prior is correct…

@martinmodrak and @hyunji.moon might have insights for you.


I think at least part of the confusion is the difference in goals for the frequentist and Bayes approaches. In freq world you can construct many different estimators for the same quantity and those will have different properties (bias, variance, …). So you care both whether your estimator is good and whether it’s implemented correctly. In a Bayesian analysis, you seek the posterior distribution, which is uniquely defined by your model. So you “only” care whether the posterior is computed correctly, which is what SBC is good at.

To run SBC in R, I encourage you to look at the SBC R package which I co-develop. An IMHO good intro to the how and why of SBC is also in the introduction to the SBC and test quantities preprint.

Still, even as a Bayesian, you might often want to use the posterior to make a decision. Given a utility function, you can evaluate the expectation of the utility across the decisions reached over whole prior (or just some selected values). You can recover the counterparts to common freq. properties with correct choice of decision + utility (e.g. if the decision is to pick a point estimate, you may take the posterior mean and take negative distance from true value as utility and you get something like bias). But you can choose the decision space and utility to match your actual real world problem. You can also define action space and utility and then find the decision rule that maximizes expected utility.

Does that make sense?


Prior-posterior, global-local, hardware-software are all relative concept, so I would just say don’t feel bad about the above feeling.

Be generous in what you perceive as data, by including the byproduct of your workflow (misfit of managerial, statistical and computational model) which informs developing family of models through iterative update.

If there is an existing model A and I propose a new model B, and if I want to compare the performance of the two models, can I use SBC? As far as I know, the most common practice is to simulate N dataset with known parameters θ and use the two models to get the estimate of θ (for example, posterior mean), and then calculate bias and MSE. The model with smaller bias and MSE is prefered.

It did until the end. I am not yet familiar with these concepts.

My end goal is to examine when my model works and when it doesn’t, by simulating many datasets with known properties. For example, how does the length of a time series affect bias or coverage. So I would simulate different time series of varying lengths, set a utility function coverage as the proportion of times the true value is in the posterior distribution, or bias as you defined it.

I ran some SBC using the SBC package (which is very intuitive to use, thank you), for a short and long time series:

So that would mean that the model is correctly calibrated?

1 Like

Those are concepts from decision theory.

This means it is not strongly miscalibrated. It looks like you have run about 100 simulations, which is very likely to uncover major problems, but some minor miscalibrations may still be hiding somewhere - you can use empirical_coverage and/or plot_coverage / plot_coverage_diff to see how big miscalibrations are still consistent with your simulation runs.

I think there might be some residual misunderstanding: if you are simulating data exactly as the model assumes, the only source of miscalibration or overly low/high coverage is a problem in computation or implementation of your model (this is unlike the frequentist case where there could be other problems).

If you are simulating data in a different way than the model assumes, than you are almost guaranteed that there will be a miscalibration. Still, it might be possible the miscalibration due to mismatch between the way you simulate the data and your model is very minor and so you have a sort of robustness check (“even if the assumptions of the model are not completely met, I get good calibration”).

I am not sure I understand what your goal here is. If the computation is correct and your simulation process matches the model, you are guaranteed to get nominal coverage (i.e. that x% interval contains the true value _0_x% of the time for any x). My talk about decision theory was meant to highlight that even when this is the case, you might be unable to make good actual decisions (e.g. because your data do not inform your parameters well enough). So for example if the time series represented price of a commodity, you could use the model to predict the future price and then evaluate the expected utility of buying now vs. buying at some later point. There is no need to do decision analysis like that, I just wanted to highlight the possibility.

Bias as I tried to define it would be something different than what you write - it will be the expected difference between the posterior mean and the true value.

At this point I am really not sure if my comments clarify more than confuse - feel free to ask more specific questions if you think it would help :-)

Not really - and I think this hinges on the difference between a “model” and an “estimator” (in the frequentist sense). In Bayesian statistics, you just need a model, basically a probabilistic description of how you believe the data arise from some unobserved parameters. You can use SBC to validate that A and/or B is implemented correctly and poses no problem for your sampler (e.g. Stan’s adaptive Hamiltonian Monte Carlo). If your computation works, than the posterior distribution is in many ways the optimal way to capture your uncertainty about the model parameters. You want to minimize MSE? Use the posterior mean. You want to minimize MAD? Use the posterior median.

Unsurprisingly, if you generate data under the assumptions of model A, you will get better performance fitting model A than model B (and vice versa). SBC (and simulations in general) however cannot tell you which of those two models is better for a given real dataset, for that you would usually need to use actual real data.

An exception would be in cases where i) you know that all models you could feasibly implement do not really capture the real data well and ii) you have access to a simulator that is believed to create something closer to real data than the actual model you implemented (e.g. some clever resampling scheme of a real dataset). Then it might make sense to have two models - both known to be imperfect - to compete on such simulated data.

Contrast that with the frequentist case: under the same statistical model, you can still derive multiple different estimators which will generally have different properties (e.g. a maximum-likelihood estimator vs. Firth’s bias corrected estimator for logistic regression). It then makes sense to observe and evaluate the behaviour of those estimators on data simulated under the model. In my (limited) understanding of frequentist statistics this is due to the fact that actually most frequentist tasks are hard and although in theory there would often exist an estimator that is optimal in some sense, this estimator is not tractable and most frequentist computation is only approximate. (happy to be corrected if this view is wrong/incomplete)


I tend to define Simulation-based calibration broadly: every checking that is based on self-consistency introduced in sec.1.1 of this sbc paper. A diagram summary of sbc’s role in bayesian computation can be found on p.17-23 of this slide I used for sbc talk.

There are bunch of precision hyperparameters that affects sbc results under the surface, such as S (number of prior draws), M (number of posterior draws), N (number of data, mostly time series), … See Figure 1 from this Bayesian taxanomy paper (Algorithm’s “setting”).

Is “known parameter” prior draws from a distribution or fixed scalar value? Fixed scalar value makes bias and MSE calculation easy, but less robust than prior draws from a known distribution. The problem with the prior draws from a known distribution is, first you run into a problem of precision (described above, how many prior draws is enough), second there is no fixed distance metric you can use to measure model performance. Performance can differ depending on the chosen metric, but Wasserstein is recommended in this post.

Thanks, I should state what my goal is to avoid confusion.

I am proposing a model to estimate a coefficient \theta and examine it against other (frequentist) estimators.

So for the frequentist estimators, I ran simulations with different scenarios where I knew the real \theta and examined bias: namely different simulated values of \theta and different length of simulated data N (e.g 1000 simulations will have \theta = 0.05 and N = 100, another 1000 simulations will have \theta = 0 and N = 50). I can then examine when that estimator is suitable for use (for example, high bias for N = 50 will indicate that the estimator is not a good choice when you only have 50 data points).

My understanding of bayesian inference is that I can have a probabilistic model that explains the data, and I will get the posterior distribution of \theta_{estimated} given the data and the prior distribution. While that posterior distribution is not exactly the same as the estimator for \theta in the frequentist sense, it still “answers” the question “Given that this is what I observe, what is the underlying \theta?”

So, if I propose a model that estimates \theta, I want to provide whoever uses that model with the tools to decide when to use that model.

Let’s say N = 50 and I see that the frequentist estimator is biased. Ideally, I could say “for N = 50, use my model”. So the question is, how do I corroborate that statement?

1 Like

The SBC perspective, and Bayesian perspective generally, is that we don’t know \theta. Thus, to know whether to expect a model to yield estimate of \theta that is unbiased and has good coverage properties, we need to check this across the range of plausible values for \theta. In Bayesian inference, we usually include in our model some explicit statement about the range of plausible values; this is the prior. So SBC is designed to check for good estimation across the entire range of the prior.

You’re right, however, that the performance of Stan’s estimators can vary with sample size (though we hope to generally catch problems via Stan’s diagnostics; for example, as we change the sample size problems might manifest as divergences). Thus, if I were in your position, what I would want to do is to run full SBC across a range of sample sizes, and to assert that your implementation of your model works well across the range of sample sizes where you pass SBC. If within this range you can find combinations of sample size and \theta where the frequentist estimators perform poorly, then it would be reasonable to follow up and make sure that SBC didn’t miss anything pathological that happens in that particular neighborhood of parameter space (particularly if \theta is high-dimensional, in which case the exploration of parameter space in SBC is bound to be sparse).


One of the challenges in discussion like these is the heterogeneity in terminology, including fundamental concepts like “inference” and “Bayesian” let alone less established terms like “calibration” or “Simulation-based calibration” the latter of which seems to be interpreted differently by everyone.

Let try to put everything into a common perspective and address some of the terminology variation along the way. I’ll largely be following along Probabilistic Modeling and Statistical Inference and Section 1.3 of Towards A Principled Bayesian Workflow, and you can consult those documents for further details.

Generally “inference” refers to quantifying some model configurations that are “consistent” with the observed data is some way. Now there is no canonical definition of “consistent” and so there is no canonical definition of “inference”; every different definition suggests a different approach to inference.


Frequentist inference doesn’t actually make any claim how to define “consistent”. Instead it buries that choice in the choice of estimator \hat{\theta}, a deterministic function that takes in observed data and returns an estimate that contains some choice of model configurations. For example a point estimator returns a single “consistent” model configuration and an interval estimator returns an entire interval of “consistent” model configurations.

Bayesian inference takes a stronger stand, applying probability theory to define “consistency”. This immediately results in the use of Bayes’ Theorem and the use of posterior distributions as “inferences”.

Given an observation then we are in some sense done. We can plug the observed data into an estimator to get an estimate or plug the observed data into a Bayesian model to get an entire posterior distribution.

Frequentist “Calibration”

Frequentist inference, however, doesn’t stop there. When there are so many choices of estimators one has to consider some criteria to narrow down the possibilities to something particularly useful. Frequentist methods quantify the performance of an estimator by considering how it behaves over a range of hypothetical observations drawn from some observational model. In other words it considers the how frequently the estimator does well and how frequently the estimator does poorly, hence the name “frequentist”.

More formally we need to define some observational model \pi(y; \theta) from which to draw data, hopefully one that well-approximates the true data generating process responsible for actual observations, and a utility function U(\hat{\theta}, \theta) that quantifies how well an estimator performs. Then for fixed “true” \theta we can define the performance as the expectation value

\bar{U}(\theta) = \int \mathrm{d} y \, \pi(y ; \theta) \, U( \hat{\theta}(y), \theta).

In terms of simulations we could generate

\tilde{y}_{s} \sim \pi(y ; \theta) \\ \tilde{U}_{s} = U( \hat{\theta}(\tilde{y}_{s}), \theta),

and then average the \tilde{U}_{s} to approximate \bar{U}(\theta).

If there is more than one possible value for \theta then things become a bit more complicated because we have to find a way of aggregating the \bar{U}(\theta) into a single value without averaging which would be equivalent to using probability theory on the parameters. The typical approach is to look at extremes, reporting the worst case behavior

U^{*} = \min_{\theta} \bar{U}(\theta).

For example if we take U(\hat{\theta}, \theta) = \hat{\theta} - \theta then

\bar{U}(\theta) = \int \mathrm{d} y \, \pi(y ; \theta) \, \big( \hat{\theta} - \theta \big)

would define the estimator “bias” for each possible “truth” \theta. An unbiased estimator would satisfy \bar{U}(\theta) = 0 for all \theta which is possible only for relatively simple models \pi(y ; \theta).

Once the U^{*} or the individual \bar{U}(\theta) have been evaluated then they can also be used to choose some “best” estimator. Alternatively they can be used to evaluate different choices that lead to different observation models \pi(y ; \theta), such as the number of observations or the kinds of observations. This approach of using hypothetical, ensemble behavior to tune the configuration of a measurement is often referred to as experimental design.

This procedure is so ingrained into frequentist methodology that it’s often taken for granted. Consequently precise terminology for the process of using hypothetical, ensemble behavior to quantify estimator performance is often non-existent. I refer to this procedure as “calibration”, although some reserve “calibration” for the process of tuning estimators based on the output of the procedure. I’ve struggled to find a less ambiguous term for this procedure but so far no luck.

Bayesian “Calibration”

Bayesian inference doesn’t require the choice of estimator, but we might still be curious how the behavior of the posterior distribution varies across possible observations. To be clear some are philosophically opposed to this, restricting all consideration to only real observations, but this kind of analysis is ingrained all over industry and the sciences and I personally am very for it.

The posterior distribution could be dropped right into the frequentist evaluation framework if we could come up with some utility function that consumes not a single estimate but rather an entire probability distribution. For example one might define something that integrates over the entire posterior distribution,

U(\pi(\theta \mid \tilde{y}), \theta) = \int_{\theta}^{\infty} \mathrm{d} \theta \, \pi(\theta \mid \tilde{y}).

Alternatively one could first reduce the posterior distribution to a single estimate, such as the posterior mean or median, and then drop that into the frequentist framework.

One huge advantage of the Bayesian approach is that we have the full power of probability theory on our side. Once we construct \bar{U}(\theta) then we can average the values together instead of just looking at extremes. Averaging, however, requires a choice of probability distribution over \theta and a natural choice is the prior model,

\bar{U}(\theta) = \int \mathrm{d} y \, \pi(y ; \theta) \, U( \pi(\theta \mid y), \theta) \\ \bar{U} = \int \mathrm{d} \theta \, \pi(\theta) \, \bar{U}(\theta),

or altogether

\begin{align*} \bar{U} &= \int \mathrm{d} \theta \, \pi(\theta) \int \mathrm{d} y \, \pi(y ; \theta) \, U( \pi(\theta \mid y), \theta) \\ &= \int \mathrm{d} \theta \, \mathrm{d} y \, \pi(y ; \theta) \, \pi(\theta) \, U( \pi(\theta \mid y), \theta) \\ &= \int \mathrm{d} \theta \, \mathrm{d} y \, \pi(y, \theta) \, U( \pi(\theta \mid y), \theta). \end{align*}

We can estimate this integral using the Monte Carlo method and simulations of model configurations from the prior and data from the corresponding data generating process,

\tilde{\theta}_{s} \sim \pi(\theta) \\ \tilde{y}_{s} \sim \pi(y ; \tilde{\theta}_{s}) \\ \tilde{U}_{s} = U( \hat{\theta}(\tilde{y}_{s}), \tilde{\theta}_{s}).

In fact instead of looking at the average of the U_{s} we can use the samples to quantify the entire distribution of utilities. This can be much more informative. For example by correlating \tilde{U}_{s} and \tilde{y}_{s} one can identify kinds of observations that lead to particularly pathological posterior behavior.

That said the prior model isn’t a necessity! One can average over any distribution of model configurations, for example distributions that concentrate on certain model configurations of interest or encode certain adversarial behaviors.

To summarize: quantifying inferential behavior over an ensemble of hypothetical observations is straightforward in Bayesian inference once we determine a utility function that consumes the entire posterior distribution. Alternatively if we reduce the entire posterior distribution to a point or interval estimate then we can just drop that into a frequentist evaluation. The huge benefit of Bayesian calibration is that it also allows us to incorporate variation in \theta and not be limited to worst case behavior.

“Simulated-Based Calibration”

“Simulation-Based Calibration” or “SBC” was introduced in [1804.06788] Validating Bayesian Inference Algorithms with Simulation-Based Calibration and is only awkwardly related to the above analyses. Unfortunately the name is terribly broad for what the method actually does, which has lead to no end of confusion.

The actual “SBC” method takes advantage of a Bayesian ensemble self-consistency condition that holds for any model,

\pi(\theta) = \int \mathrm{d} \theta' \, \mathrm{d} y \, \mathrm{d} \theta \, \pi(\theta' \mid y) \, \pi(y \mid theta) \, \pi(\theta).

In terms of simulations

\tilde{\theta}_{s} \sim \pi(\theta) \\ \tilde{y}_{s} \sim \pi(y ; \tilde{\theta}_{s}) \\ \tilde{\theta}'_{s} \sim \pi(\theta \mid \tilde{y})

this self-consistently condition implies that the \tilde{\theta}'_{s} should be indistinguishable from prior samples. The actual method is a bit more complicated but essentially it compares the \tilde{\theta}'_{s} to the \tilde{\theta}_{s} in a careful way to construct a histogram that will always be uniform.

That is unless the simulations aren’t generated correctly. For example if the posterior samples \tilde{\theta}'_{s} \sim \pi(\theta \mid \tilde{y}) are erroneous then the SBC histogram will be skewed away from uniform. This provides a way to check for the accuracy of the computational method that generates posterior samples.

Critically if the posterior computation is good then the SBC method will return a null result no matter the ensemble behavior of the posterior distributions. SBC as introduced in that paper has no consideration as to the inferential performance of the model.

To avoid this confusion I refer to studies of the hypothetical ensemble behavior of an estimator or posterior distribution as “inferential calibration” and ensemble studies sensitive to computational problems as “algorithmic calibration”, although I’m pretty sure that no one else uses this terminology which leads to the term “calibration” being horrendously overloaded.

This is especially true in practice. Note that the simulations

\tilde{\theta}_{s} \sim \pi(\theta) \\ \tilde{y}_{s} \sim \pi(y ; \tilde{\theta}_{s})

are used in both the algorithmic calibration “SBC” and the frequentist/Bayesian inferential calibrations that we discussed above. Both consider hypothetical ensemble behavior, but they do it for every different purposes.

Long Story Short

If you want to compare a Bayesian method to mostly frequentist methods then it’s probably easiest to just reduce the posterior distribution to a point estimator and apply the same frequentist calibration to all.

If you want to be a bit more sophisticated about the parameter dependence of the frequentist calibration then you could average the \bar{U}(\theta) over some relevant distribution, such as the prior model.

If you just want to compare different Bayesian models then a utility function that takes into account the entire posterior distribution can be especially useful.


Thanks for the exhaustive treatment! However a minor clarification:

This statement is repeated throughout the literature on SBC, but is incorrect. One can in fact construct procedures that satisfy this condition (that data-averaged posterior equals prior) but fail SBC and procedures that satisfy SBC as commonly used but where data-averaged posterior is not equal to prior.

Instead, SBC relies on the fact that conditional on any specific y, we have \theta and \theta' exchangeable. Since this is mostly tangential to OPs question, I won’t go into detail here and refer those interested to our preprint where we discuss this and some other properties of SBC: (Introduction lays out the different criteria in unified language; high-level discussion of SBC and data-averaged posterior in section 3.5; specific counterexamples in Appendix B, Example 2; associated simulation results at Simple Bernoulli Examples)


Thank you very much for this thorough answer!

1 Like

Just to clarify, the simulation-based calibration idea came from Samantha Cook’s Ph.D. thesis and is described in this article:
The Talts et al. paper has some corrections and improvements to the idea, and we and others (although not Cook, unfortunately) are doing more work in the area.

In particular, this new paper by Martin Modrák, Angie Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Aki Vehtari, and myself, which Martin discussed on the forum here: New preprint on model/algorithm validation with SBC - It is stronger than you thought!

1 Like