I read the wiki page and I am a bit confused. Suppose that in a given model, I have a parameter f, and I set it to a known value f₀, simulate data, and using that data I ran Stan (using the model) and get posterior draws fhat after burn-in. Then in

Δ = (fhat-E[f]) / MCMC-SE ~ N(0, 1)

should I use E[f] = f₀, or the sample average of fhat? Similarly, is MCMC-SE the standard error of fhat, calculated using the effective sample size?

Cook, Gelman and Rubin (2006) [Validation of Software for Bayesian Models…] propose using posterior quantiles of the true value, transformed into a χ² via a unit normal. Is that related, or complementary to the above approach?

Yes, MCMC-SE is the MCMC standard error, which is estimated by dividing the estimated posterior standard deviation by the square root of the estimated effective sample size.

@betanalpha will have to answer the rest about (1), but I can say the technique’s different than Cook et al., which looks at posterior quantiles relative to simulated values, and uses the inverse CDF for a normal to convert them back to what should be a standard normal and then tests whether that is true.

Regarding (1), no. The MCMC Central Limit Theorem that you quote makes no statement about how close the MCMC estimator is to the true value of the parameters.

Plugging data into the likelihood and adding the prior gives you a posterior distribution, and we extract information from that distribution by computing posterior expectations, E[f]. The MCMC Central Limit Theorem says that for certain well-behaving Markov chains the MCMC estimator of a given posterior expectation, f_hat, will be close to the true expectation E[f]. In fact it will be unbiased with standard error MCMC-SE relative to all possible realizations of the Markov chain. But if your model is poorly specified then those posterior expectations could be very far off from the true values and MCMC can’t fix a bad model.

Note that all of this assumes that the MCMC Central Limit Theorem holds, which is not guaranteed in practice. This is why I’m always yelling about divergences and other diagnostics that identify obstructions to the Central Limit Theorem holding in practice.

The only guarantees that we have about calibration of posterior expectations and true parameter values are over ensembles of experiments, which leads to (2). In other words, even if the model is well-specified then there is no guarantee that in any given analysis the posterior captures the true value. But it will over an ensemble of analyses following the procedure in Cook et al.

So effectively to use the heuristic integration testing procedure, I would have to sample from a distribution for which I know E[f]?

Also, what do you mean by

look at Delta = (f_hat - E[f]) / MCMC-SE and compare to the distribution N(0, 1)

since f_hat is a single value from each chain. Should the f_hats from multiple chains be compared to N(0, 1) with a test, say Kolmogorov-Smirnov? Or p-values for single f_hats?

Finally, where can I find these tests in the codebase? I found the test directory in stan-dev, but nothing like the example in the wiki.

The closest thing we have is this repo from Michael:

We’re still working getting this all better integrated into the overall workflow.

This should hold for all the quantities of interest over multiple simulations of data and fit. You could also run something like the Cook-Gelman-Rubin diagnostics.