ADMIN EDIT: Moved this to a new topic, as I believe this is of general interest (by @martinmodrak)
I always wanted discuss the need for narrower SBC prior compared to general modeling situations. This results from the fact that SBC does not include any real world data that could mitigate the prior effect.
Do you think this is a disadvantage of SBC or a measure that makes calibration conservative but more robust?
Also, what is your opinion on leaving out fits with bad diagnostics? These would correspond to extreme prior parameter and simultaned y samples in most cases. It seems somewhat practical but I am worried this intervention is too subjective and we do not have clear standard on which fit to leave out.
Each statistical computation benchmark is a specific posterior distribution with validated posterior expectation values not a generic model.
SBC tests the consistency of a computational method of an entire ensemble of posterior distributions generated from prior predictive data which (usually, although not always) requires that the computational method fits all of those posteriors well. If the priors are wide then more and more posterior behaviors will have to be accommodated which may frustrate the given computational method even if some of the posteriors distributions fit fine.
I think this an important issue in using SBC in practical applications.
In my current thinking, the the underlying problem is that we are unable to express good priors. In most cases when I see a dataset in my SBC that generates divergences or has other issues, I can say āthis is a-priori not a reasonable datasetā (e.g. all responses being 0 in a Poisson model). In this sense my prior does not pass a prior predictive check.
However, when I try to enforce that only āgoodā datasets are generated by tightening the priors on parameters, the priors often end up unrealistically narrow. This hints that my prior information cannot be expressed by putting independent priors on each parameter. And while I know that some people are working on methods for expressing joint priors on all parameters, I think this stuff is currently pretty hard to do well. So I know my prior is bad, but I also donāt know how to easily make it better. I am usually also often quite sure that the wider prior - despite giving trouble to SBC - is enough regularization to fit real datasets. What I am left with are heuristics and shortcuts - which I donāt think is bad, but one has to be aware that they are no longer on such a solid ground.
There are multiple available heuristics, Iāve personally tried:
Use the wider prior for SBC and ignore the divergences. If most fits are OK and the issues are not very serious (e.g. Rhat and ESS is good, but I have a single-digit number of divergences), I usually just ignore it. Even those small problems will most certainly introduce slight bias in the SBC, but my hunch is that the bias will be much smaller than the resolution I have for any realistic number of simulated datasets I am able to fit. However, when the problems are more common/serious, the SBC may be heavily affected.
Use an unrealistically narrow prior for the SBC that makes the problems in fitting go away. I have to rely on the assumption that a good computation with narrow prior translates to good computation with wider priors (at least for the cases where my diagnostics are good). Usually I donāt think this is such a problematic assumption - many bugs/bad parametrizations/ā¦ are likely to manifest even with a narrower prior. Also passing SBC with the narrow prior is a much stronger guarantee of the model working than not doing SBC at all.
Rejection sampling. I formulate my a-prior knowledge about what āgoodā data looks like as a binary condition (e.g. that both minimum and maximum categories are observed in an ordinal model, or a minimal variance of the total response). I then reject all generated datasets that fail the check and generate new ones. This procedure can be understood as adding a (probably weird) joint prior on all the parameters. EDIT: Turns out this does not invalidate SBC (see @nhuurreās post below) But now my Stan model has different prior than what I use to generate the data and the SBC should fail given enough runs. However when the rejections are not frequent, Iāve found that the resulting SBC looks often very good. Anyway, tracking how many rejections you need to generate a single dataset is a good diagnostic.
It actually never ocurred to me just drop the non-converging runs from the final diagnostics as @hyunji.moon suggests - that could also be sensible (especially with the new visualisations that Aki mentioned recently that donāt rely on binning). (and as @nhuurre points out below, should not invalidate SBC for the converging subset)
In practice, Iāve often found myself going through all of the steps: adding rejection sampling, then narrowing down the prior somewhat to avoid unreasonable number of rejections and then tolerating a few percent of fits that donāt converge.
Very interested in what other SBC practitioners have to say about this.
This is something that has also concerned me. Generally I have rejected fits with <0.01 effective samples / sample, simply because even though they could be used (i.e. with more samples and corresponding thinning factor), the required computational resources make it impractical. However, as @martinmodrak points out this kind of rejection sampling modifies the priors used in SBC, so I agree that it should not be something that happens too frequently.
Thanks for the clarification, it now makes sense to me why narrower priors are needed for the models even though the fits are well-behaved for specific posteriors.
Problem-causing fits usually starts from extreme values of \tilde{y}. The following plot from this issue shows sometimes simulated datasets (green, y_sim#7 and orange, y_sim#20) are at least a hundred times bigger than the real-world dataset (red).
Rejecting them is computationally beneficial as I observed those extreme fits (y_sim#7 for instance) could take 100 times longer, but I think we should keep in mind this additional action hurts the theoretical support from the SBC uniformity proof.
Rejection sampling doesnāt have to be a problem.
Let f\left(y\right) be the probability that the simulated dataset y is rejected (usually a 0-1 function if you have a clear idea what a ābadā dataset looks like, but could be probabilistic if youāre relying on finicky diagnostics). The important numbers are the probability of rejection for parameter \theta
And since \frac{f(y)}{R} is a constant, the overall posterior is the same whether we take rejection into account or not.
Also I did check empirically with a simple example and indeed, despite heavy rejections the ranks are uniformā¦ So my intuition was once again proven bad :-) Now I canāt explain why I thought it was a problem in some of my SBC runs, where heavy rejections were accompanied by bad histogramsā¦
Anyway, thanks for clarifying!
Little example to test
library(SBC)
library(cmdstanr)
stan_code <- "
data {
int<lower=0> N;
real y[N];
}
parameters {
real mu;
}
model {
mu ~ normal(0, 2);
y ~ normal(mu, 1);
}
generated quantities {
real mu_ = normal_rng(0, 2);
real y_[N];
for(n in 1:N) {
y_[n] = normal_rng(mu, 1);
}
}
"
n_datasets = 1000 # total number of simulated parameter sets from theta
N = 10
thin = 3
data <- list(N=N, y=as.vector(1:N))
model <- cmdstan_model(write_stan_file(stan_code))
sbc_obj <- SBCModel$new(name = "simple_test", stan_model = model)
hyperpriors <- list("mu"=function(){rnorm(1, 0, 2)})
theta_prior <- sbc_obj$sample_theta_tilde(list("mu"), n_datasets, hyperpriors)
sampled_y <- sbc_obj$sample_y_tilde(theta_prior , data=data)
theta_post <- sbc_obj$sample_theta_bar_y(sampled_y, data=data, pars=list("mu"), fit_iter = 200 * thin)
rank <- calculate_rank(theta_prior, theta_post, thin = thin) # thinning factor of 3
plot_hist(rank, "mu")
SBC is a special type of Bayesian calibration which considers posterior behavior over the entire prior predictive distribution. If the prior predictive distribution includes extreme values that you donāt think are reasonable then some part of your model is in tension with your domain expertise (otherwise you wouldnāt be able to say that anything is āextremeā). In this case the tension could arise from an overly-diffuse prior model that admits extreme model configurations or it could arise from an overly-diffuse observational model.
Truncating the observational model at a fixed threshold would remove the extreme observations and introduce only a modified normalizing constant which can largely be ignored, including in SBC. But truncations are a rather crude way to modify the observational model and can obfuscate the important reasons why the such extreme observations were supported in the first place.
Long story short ā āextreme simulationsā are not a problem inherent to SBC but rather to modeling. Thereās no point in evaluating Bayesian calibrations with respect to a model until youāve investigating the consequences of the assumptions encoded in that model. This is why prior checking comes from in my recommended workflow, Towards A Principled Bayesian Workflow.
A limitation of SBC is the requirement of a generative model: the prior Ļ(Īø) should be a proper distribution and one should be able to sample from the joint model Ļ(Īø, y). These conditions apply in many but not all Bayesian models in practice. Another concern is the behavior when the prior is very weak relative to the likelihood, so that a huge number of replications would be needed to test the computation in the relevant region of parameter space. This could cause problems if the computation works well for some parameter values but not for others.
As discussed in our workflow paper, it is common in practice to use weak priors, out of some combination of respect for non-Bayesian tradition and concern about missing important data features by oversmoothing. When priors are very weak, prior predictive simulation can result in datasets that result in computational challenges (for example, discrete data that are nearly all zeroes) which in turn can cause problems with SBC even if the algorithm in question would perform well on more realistic datasets. More generally, there is a tension between the goal of using SBCāor any validation procedureāto test a general algorithm, and the goal of checking that it works well for a particular example.
Iām quite late to this and my question is probably elementary, but what are typical models for which ātoo wideā priors lead to computational problems (for some but not all simulations)?Or what is the simplest one?
Iām sure there are better and more nuanced answers, but one answer is āliterally any modelā. Any model will eventually encounter numerical issues if the prior is too wide.
Hereās a quick example: Poisson regression where thereās a wide prior on the intercept. Depending on the data and the other parts of the model, if the intercept is low enough, the model will with high probability produce data that are all zeroes, and then itās hard to estimate anything.
Just realized one thing that is probably worth adding to the thread for people reading it in the future, despite it possibly being obvious for many:
Rejection sampling is OK (i.e. only changes the normalizing constant of the posterior), if the probabality of rejection f(y) depends solely on observed data y. If there is a dependence on the parameters \theta (beyond the indirect influence \theta has on y), the change to the posterior is no longer just a constant (and SBC will signal an issue).
This is very useful info!!! One of the biggest challenges for me with SBC is that the weak priors generate many ānear-outlierā data values that NUTS struggles with, but inference is perfectly fine with more reasonable data values.
Just to double check, I can run Simulation Based Calibration by
Running N simulations
Rejecting any simulations based on any arbitrary standard f(y) => accept/reject, eg reject any simulation with y < 0 or y > 10.
Run inference on each of the remaining, non-rejected simulations
Analyze the resulting rank histograms among the non-rejected simulations
Iām curious - the original SBC paper said:
A popular alternativeā¦ is to define a ground truth ĢĪø, simulate data from that ground truth, Ģyā¼Ļ(y| ĢĪø), and then quantify how well the computed posterior recovers the ground truth in some way. Unfortunately this approach is flawedā¦
Just to make sure I understand the subtleties hereā¦ selecting simulations based on Īø is BAD and invalidates SBC, but selecting simulations on y is OK and does not invalidate the SBC results?
I assume that limiting simulations to a subset of y values limits the assurance of computational accuracy to just that subset of y.
That is IMHO reasonable. Currently working with others on revamp to the SBC package for R to make all of this easier (expect some announcements soon).
Yes. Although I think the original quote is concerned not with selecting simulations based on \theta, but with the fact that simulating datasets for any particular \theta value give only limited information and to get useful guarantees one needs to average over the whole prior for \theta (e.g. simulate from the prior as in SBC).
Yes, exactly. A practical corollary is that in fact, you can reject simulations that created divergences (or other fitting problems) as the probability of divergence is only a function of y. This way, youāll still get a valid calibration check for all datasets that do not produce divergences - which is often all the datasets we care about, because if you get a divergence, youāll know you canāt really trust the estimates anyway.