Model Checking/Validating Best Practices

Hi All,

I want to get a discussion going on how Stan users on the forum go about checking and validating their models and best practices.I know this can be case-specific and information can be found in the various user manuals, but I’m still curious,


At a minimum, a good model should produce posterior estimates with \hat{R}\leq 1.01 and large effective sample size (ESS). SE_mean should also ideally be near zero, ensuring that the simulation has been run long enough. If not, then the number of chain iterations can be increased from the default of 2000, to 2500, say.

Chain mixing is checked visually via traceplots of parameter estimates. Plots should look like “fuzzy caterpillars”.

In this 2017 case, study, Michael Betancourt, walks through a robust statistical workflow using Stan:

which also includes examining aspects of the HMC sampler itself such as the tree depth, E-BFMI (Energy Bayesian Fraction of Missing Information), and divergence.

Finally, model reparameterization may be needed to fix common issues.


Model validation typically consists of drawing simulated values from the model via posterior predictive checks. Adequacy of priors is analogously done via prior predictive checking.

Simulation-based calibration (SBC) can further shed light on model performance.


This is only a summary, but I’m wondering what others think. Care to weigh in?

1 Like

In addition to @betanalpha’s workflow case study there’s a later paper on workflow by Gelman et al.:

I would not recommend inspecting trace plots by eye. In addition to high ESS, what you want to see is that doubling the length of Markov chains doubles the ESS.

I’d break the validation down into validating the algorithm’s calibration on simulated data (SBC), evaluating the prior (prior predictive checks), evaluating the fit to data (posterior predictive checks), and fit to new data (cross-validation). There’s a part of the User’s Guide that goes over how to code all of these in Stan.

I think the general approach around here to internal validation (e.g. some form of cross validation) strategies probably loosely follows Efficient Leave-One-Out Cross-Validation and WAIC for Bayesian Models • loo and its associated papers.

That old case study is long out of date! More importantly it considers only computational problems and not modeling problems. My up-to-date writing is at Writing - and for this topic I’d recommend taking a look at
GitHub - betanalpha/mcmc_diagnostics: Markov chain Monte Carlo general, and Hamiltonian Monte Carlo specific, diagnostics for Stan
Identity Crisis
Towards A Principled Bayesian Workflow
with emphasis on the latter.

Recall that in Bayesian inference we use our domain expertise to motivate a full Bayesian model,

\pi(y, \theta) = \pi(y \mid \theta) \, \pi(\theta),

plug in observed data \tilde{y} to obtain a posterior distribution,

\pi(\theta \mid \tilde{y}) \propto \pi(\tilde{y}, \theta),

and then extract approximate insights from posterior distribution through expectation value estimates,

\hat{f} \approx \int \mathrm{d} \theta \, \pi(\theta \mid \tilde{y}) \, f(\theta).

The immediate challenge in implementing Bayesian inference is computational – how well does the estimate \hat{f} approximate the true expectation value? If our estimates are too inaccurate then we will be effectively working with a skewed posterior distribution, and any problems with those inferences could be due to the skew rather than any inherent issues in our modeling assumptions.

Consequently the first step is to quantify the error in our posterior expectation value estimates. Exactly how we do this depends on the estimation method we employ – for example \hat{R} is one diagnostics that can identify pathological behavior in Markov chain Monte Carlo estimators. In general diagnosing problems in Markov chain Monte Carlo is much more subtle than just checking a few diagnostics – see for example the above link as well as Markov Chain Monte Carlo Basics.

Once we trust our posterior computation then we can tackle the adequacy of our modeling assumptions inherent to the choice of Bayesian model \pi(y, \theta). In particular we can compare how well the posterior distribution recovers features of the observed data through posterior retrodictive checks… Because we’re comparing to the data that we’ve already used we’re retrodicting here, not predicting . Posterior predictive checks describe comparisons to held-out data not used to inform the posterior distribution.

The difficulty here is coming to terms with the fact that our model will never be perfect, but at the same time that our observations will only ever offer limited resolution of the system being observed. In other words we have to determine which features of the system are relevant and then design summaries that can focus posterior retrodictive comparisons on those behaviors. In my experience this means that the automated, and hence unable to be tuned to the specifics of any particular analysis, checks that are commonly recommended have limited utility in practice.

Anyways this is all discussed in much more depth in Towards A Principled Bayesian Workflow so I’d recommend starting there. The workflow I suggest is also applied over and over again in Part III and the Case Studies on Writing - so you can also see its benefit in action.