Simulation-based calibration case study with RStan


I created a standalone case study for SBC:

  • Simulation-Based Calibration with Stan and RStan. sbc.pdf (268.1 KB)

I’d love to get feedback before putting it up with the rest of our case studies. Particularly on my less-than-fluent R, on anything that’s not clear in the presentation, and on the hypothesis testing for uniformity of ranks so this can all be automated.

P.S. @avehtari tells me I should be using this confidence bands code, but even after a couple emails and trying to read the paper on which it’s based, I can’t figure out how to turn bands into p-values. All I need are p-values for a bag of ranks for the null hypothesis that the bag of ranks is uniformly distributed.


Here’s the link to multidimensional uniformity test paper

I still don’t understand why do you need p-values, when you can directly control false positive rate (in some meaning of control), that is, you can choose alpha level for your test. If you make more tests you can include the multiplicity adjustment there. But if you really need individual p’s, I have time next week to make you code for the inversion.

Based on a quick look, it seems you are not thinning antithetic chains? Those can have high n_eff and high correlations messing uniformity. With the current dynamic HMC, it’s very likely that chain for lp__ has thetic behavior, while for other parameters the chain may have highly antithetic behvior. For antithetic cases it would be good to thin with odd lag.

1 Like

I don’t need the p-values if I can just flag ones that would have a p-value lower than a given alpha threshold. Is that what you’re saying I could do?

You’re right, I’m not.

By an odd lag, you mean something like thin = 3 or something? I’ll have to rethink the logic to have thinning be odd because I can’t just keep doubling it. But I think you’re saying that no matter what I do, I should always do an odd thinning to deal with the antithetical behavior. As a computer scientist, I like everything to be a factor of two :-)


Yes. The code let’s you define alpha threshold for one test. You can adjust that, e.g., using Bonferroni adjustment, for multiple comparison and the code will compute the envelope and then just test whether any part of ecdf (or ecdf difference) is outside the envelope.


Double + 1?


@Bob_Carpenter We have put a decent amount of work into the SBC.R function on GitHub to make it easier for people to do SBC and document the conventions that are needed to make it work. I asked everyone for suggestions on how to improve it if they disagreed with the conventions or the functions but no one said anything. I think it would be better if we could all get on the same page as to what people should be calling to do SBC.


Where precisely?


@avehtari, what I don’t understand about these bands derived from the 2013 paper by Sivan Aldor-Noiman: Aren’t they derived using the CDF’s of order statistics of [0,1]-valued (continuous) uniform RVs? So for fixed rank, say k, isn’t this proposing to use in the ECDF plot the CI interval for the k’th largest Uniform random number, out of n, where n is the total number of ranks. Am I missing a simple relationship here or am I completely getting this wrong?


Yes, which we approximate with beta for large n. Do have better idea? Additional step in the envelope algorithm is to take into account the multiple comparison as we test for all k. For MCMC rank plots we need a small change. We’ll provide more info soon,

1 Like

By no means! I just try to understand.


I was already hoping for something better :)


My goal was explaining SBC, not writing a reference implementation. I found SBC.R here:

Your code’s beyond my R reading comprehension. Could you point out where you make sure you have enough iterations after thinning?

Why is there Pareto K stuff in there? Is that part of SBC or just some other diagnostic for convergence that’s layered onto SBC?

1 Like

The idea is to output unthinned draws and thin when plotting the histogram. That way you can change the thin argument, depending on whether it is antithetic or thetic. The Pareto k is optional, but if you put log_lik into the generated quantities of the Stan program, the print method will tell you what is the proportion greater than 0.7, etc.


@avehtari, following up on the short discourse regarding CI’s, just to make sure to see where the analytic “difficulty” is located:

If we denote by \mathcal{C}_k the number of (integer) ranks that are less or equal to k (integer) with 0 \leq k \leq L. Don’t we have for \ell (integer) with 0\leq \ell \leq N, under the (Null)-hypothesis that the N ranks are i.i.d. uniform on the integers \{0,1,\dots, L\}:

\mathbb{P}[\mathcal{C}_k \leq \ell ] = I_{1-\frac{k+1}{L+1}}(N-\ell, 1+\ell)

where N is the number of synthetic datasets generated as part of the SBC algorithm and L the number of (uncorrelated) posterior samples generated, given a synthetic dataset. I is the regularized incomplete beta function. This comes via the cumulative distribution function of a Bionomial with number of trials equal to N and success probability equal to (k+1)/(L+1), evaluated at \ell.

Now, the empirical cumulative distribution function (ECDF) at rank k, denoted by \mathcal{E}_k\in[0,1]\subseteq \mathbb{R}, is distributed as \mathcal{C}_k/N thus

\mathbb{P}[\mathcal{E}_k\leq \frac{\ell}{N}] = \mathbb{P}[\mathcal{C}_k \leq \ell ] = I_{1-\frac{k+1}{L+1}}(N-\ell, 1+\ell)

In theory we could use this to get the intended CI, given the null hypothesis of N ranks uniformly distributed over the integers {0, 1,\dots,L}:

So, for each 0\leq k \leq L we need to find integers \ell_k \geq \ell_k' such that (say)

\mathbb{P}[\frac{\ell'_k}{N}\leq \mathcal{E}_k] = 0.05 \Leftrightarrow \mathbb{P}[\frac{\ell'_k}{N}\geq \mathcal{E}_k] = 0.95


\mathbb{P}[\frac{\ell_k}{N}\geq \mathcal{E}_k] = 0.05

to get a 90\% CI. (For each k): So for the first case, couldn’t we start at \ell_k'=N and stepwise decrease \ell_k' by one, until we find that \mathbb{P}[\frac{\ell'_k}{N}\geq \mathcal{E}_k]\leq 0.95? Similar, for the second case, we would start at \ell_k=0 and increase by one, until the first time we have \mathbb{P}[\frac{\ell_k}{N}\geq \mathcal{E}_k] \geq 0.05?


Thanks for the very clear presentation

It seems you are not certain of this? Well, I think you are right on what you present.

cumulative distribution function of the beta distribution

and the envelope is using beta distribution, but we were just ignoring discreteness C_k/N

The difficult part comes when want examine all k at the same time. Your derivation gives the interval for one k independently, but C_k’s are not independent. It would be great to have non-simulation based solution which would take into account the dependency between C_k’s. Due to the existence of simulation based approaches like in Aldor-Noiman paper, I’ve assumed that there is no easier solution.