Hi, I thought this might interest many Stan users: Simulation-based calibration: Some challenges and directions for future research « Statistical Modeling, Causal Inference, and Social Science

It’s following up on the recent StanConnect session organized by Martin Modrák, Hyunji Moon, and Shinyoung Kim.

I hope that the ongoing development of the SBC package helps with some of the concerns.

To be specific:

(b) Coming up with an interpretable measure of miscalibration rather than framing as a test of the hypothesis of perfect calibration.

A simple approximate solution is to look at the observed coverage of all central posterior intervals. We can than get an approximate uncertainty around the coverage by treating all intervals as independent and using quantiles of the Beta distribution (assuming uniform prior on coverage here). This is how this may look after we ran 8 simulations with SBC:

The black line is the observed coverage, gray area is 95% credible interval and the blue line represents perfect calibration. We see there is huge uncertainty remaining - inspecting the underlying numerical values we see that the 90% posterior interval for `theta`

could (as far as we know) contain 40% - 93% of the true values.

And this is the same model after running some 70 simulations:

:

Things have definitely improved!

And here’s the same plot for a different model that is slightly miscalibrated:

So it appears there is miscalibration, but it also isn’t completely terrible. Looking at the numerical values, we see that for example for the 90% central credible interval of `sigma`

we would expect the actual coverage to be 74% - 85%

(c) Incorporating this into workflow so that it’s convenient and not computationally expensive.

I just completed a vignette describing my current thoughts on using SBC (via the SBC package) in model building workflow. It is at Small model implementation workflow • SBC (the above figures are taken from the vignette and from Limits of SBC • SBC which discusses some limitations of SBC as a diagnostic).

Would be very happy to get your feedback on both the coverage visualisation/computation and the workflow thoughts - we are currently hoping to slowly lure a couple people to try the SBC package out to let us iron the biggest issues before trying to promote it more actively.