I think that we can take the opportunity of these new diagnostics to better communicate the intent of the MCMC and differentiate that intent from the diagnostics used to check the validity of the MCMC towards that intent.
To avoid confusion and make sure we’re all on the same page let’s be very explicit. We run Stan over some sample space, X = \mathbb{R}^{N} defined by the given program, where the $n$th variable in the parameters
block defines a one-dimensional projection operator which I’ll denote \varpi_{n} : X \rightarrow \mathbb{R}. The transformed parameters
and generated quantities blocks
define additional random variables but for simplicity let’s ignore those for now.
The default output considers each projection operator as a random variable and reports
- The MCMC estimator of its expectation value (equivalently, the mean of the corresponding pushforward distribution).
- The MCMC estimator for the corresponding standard deviation.
- The MCMC estimator for the effective sample size.
- The MCMC estimator for the MCMC standard error, assuming a central limit theorem.
as well as
- The potential scale reduction factor.
- MCMC estimators for various quantiles.
The first set of reports are well-defined in the context of a CLT – they’re everything that’s needed to estimate the mean and the standard error of that estimator. Outside of the context of a CLT, however, these values become less relevant.
The second set of reports start adding on additional structure. The potential scale reduction factor estimates a necessary (but not sufficient) condition for a CLT to exist (and have kicked in). The quantiles provide additional information about pushforward distribution of the projection operator for each parameter, albeit without any quantification of standard error in the estimator.
I think that one of the things that makes this discussion tricky is the current convolution of these reports which are trying to do different things, so let’s step back and ask what information do we want to report. I believe there are four things of interest
1 - Does the Markov chain enjoy a central limit theorem, and hence will effective sample sizes and MCMC standard errors make sense in general?
2 - Does a specific random variable have the finite mean and variance for the CLT to apply to its expectation value?
3 - What is the estimator for its expectation value and standard error if (1) and (2) look good?
4 - What other properties of the pushforward distribution might we want to look at other than the mean, especially if (2) doesn’t look good?
Divergences, energy fraction of missing information, and the ensemble of Rhats (new and old) and diagnostic effective sample sizes (as @avehtari has denoted them) all fall into (1). The bulk and tail diagnostic effective sample sizes also could fall into (2), as well as the other experimental diagnostics like khat estimators. The standard effective sample size, MCMC estimator, and standard error all belong in (3), and quantile estimators along with any standard errors belong in (4).
I think that it will help users better understand the structure of MCMC, and set us up for adding more diagnostics in the future, if we design the default summary to follow this sequence.
One possibility is to dump everything and let the user interpret.
Inference for Stan model: eight_schools_cp.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
There were 40 divergences (1%).
Chain 1 E-FMI = 0.5338
Chain 2 E-FMI = 0.5338
Chain 3 E-FMI = 0.5338
Chain 4 E-FMI = 0.5338
mu
Rhat BulkESS Tail ESS
1.01 695 1124
Mean StdDev MCMCSE
4.50 3.26 0.12
Median MAD MCMCSE
0.12 0.1 0.01
5% 25% 50% 75% 95%
-1.96 2.28 4.53 6.66 10.80
<repeat for other parameters>
<explanation for how to interpret each parameter>
Another is to automate the checks.
Inference for Stan model: eight_schools_cp.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
There were 40 divergences (1%)! Any divergences indicate biased exploration. Try increasing adapt_delta or reparameterizing.
<no message here if divergences = 0>
Chain 2 E-FMI = 0.02338
Low E-FMI indicates ineffective exploration of the joint distribution.
<no message here if none of the chains have E-FMI less than threshold>
<maybe stop here if the above diagnostics failed>
mu
Rhat for mu is above 1.3! You probably shouldn't trust the mean estimator.
BulkESS is below 10% of total iterations. You probably shouldn't trust the mean estimator.
Mean StdDev MCMCSE
4.50 3.26 0.12
<maybe don't even show this if the above diagnostics failed>
Median MAD MCMCSE
0.12 0.1 0.01
5% 25% 50% 75% 95%
-1.96 2.28 4.53 6.66 10.80
<repeat for other parameters>
Yes this is much more verbose and doesn’t fit into a table, but it mirrors the sequential structure of what’s actually going on and how users should be utilizing their fit output. The API routes can be expanded to include something like
summary['global_diagnostics']
summary['mu']$diagnostics
summary['mu']$mean_est
summary['mu']$median_est
summary['mu']$quantiles
to allow for efficient programatic access.
Finally we should probably leave the current summary functionality as it is. While we didn’t explicitly promise that the table output would be preserved (and in general it’s bad practice to use undocumented indices when named variables are available) we have not really documented the summary API well enough to be able to say that didn’t promise it. Moving forwards we should be explicit about what is officially part of the API and what is not, but the easiest thing to do until Stan3 is to keep summary
and introduce a shiny_new_summary
.