New R-hat and ESS

I think that we can take the opportunity of these new diagnostics to better communicate the intent of the MCMC and differentiate that intent from the diagnostics used to check the validity of the MCMC towards that intent.

To avoid confusion and make sure we’re all on the same page let’s be very explicit. We run Stan over some sample space, X = \mathbb{R}^{N} defined by the given program, where the $n$th variable in the parameters block defines a one-dimensional projection operator which I’ll denote \varpi_{n} : X \rightarrow \mathbb{R}. The transformed parameters and generated quantities blocks define additional random variables but for simplicity let’s ignore those for now.

The default output considers each projection operator as a random variable and reports

  • The MCMC estimator of its expectation value (equivalently, the mean of the corresponding pushforward distribution).
  • The MCMC estimator for the corresponding standard deviation.
  • The MCMC estimator for the effective sample size.
  • The MCMC estimator for the MCMC standard error, assuming a central limit theorem.

as well as

  • The potential scale reduction factor.
  • MCMC estimators for various quantiles.

The first set of reports are well-defined in the context of a CLT – they’re everything that’s needed to estimate the mean and the standard error of that estimator. Outside of the context of a CLT, however, these values become less relevant.

The second set of reports start adding on additional structure. The potential scale reduction factor estimates a necessary (but not sufficient) condition for a CLT to exist (and have kicked in). The quantiles provide additional information about pushforward distribution of the projection operator for each parameter, albeit without any quantification of standard error in the estimator.

I think that one of the things that makes this discussion tricky is the current convolution of these reports which are trying to do different things, so let’s step back and ask what information do we want to report. I believe there are four things of interest

1 - Does the Markov chain enjoy a central limit theorem, and hence will effective sample sizes and MCMC standard errors make sense in general?

2 - Does a specific random variable have the finite mean and variance for the CLT to apply to its expectation value?

3 - What is the estimator for its expectation value and standard error if (1) and (2) look good?

4 - What other properties of the pushforward distribution might we want to look at other than the mean, especially if (2) doesn’t look good?

Divergences, energy fraction of missing information, and the ensemble of Rhats (new and old) and diagnostic effective sample sizes (as @avehtari has denoted them) all fall into (1). The bulk and tail diagnostic effective sample sizes also could fall into (2), as well as the other experimental diagnostics like khat estimators. The standard effective sample size, MCMC estimator, and standard error all belong in (3), and quantile estimators along with any standard errors belong in (4).

I think that it will help users better understand the structure of MCMC, and set us up for adding more diagnostics in the future, if we design the default summary to follow this sequence.

One possibility is to dump everything and let the user interpret.

Inference for Stan model: eight_schools_cp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

There were 40 divergences (1%).

Chain 1 E-FMI = 0.5338
Chain 2 E-FMI = 0.5338
Chain 3 E-FMI = 0.5338
Chain 4 E-FMI = 0.5338

mu 
Rhat BulkESS Tail ESS
1.01  695          1124

Mean StdDev MCMCSE
4.50   3.26      0.12

Median MAD MCMCSE
0.12      0.1    0.01

5%      25%   50%   75%  95%
-1.96   2.28   4.53   6.66  10.80

<repeat for other parameters>

<explanation for how to interpret each parameter>

Another is to automate the checks.

Inference for Stan model: eight_schools_cp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

There were 40 divergences (1%)!  Any divergences indicate biased exploration.  Try increasing adapt_delta or reparameterizing.  
<no message here if divergences = 0>

Chain 2 E-FMI = 0.02338
Low E-FMI indicates ineffective exploration of the joint distribution.
<no message here if none of the chains have E-FMI less than threshold>

<maybe stop here if the above diagnostics failed>

mu 
Rhat for mu is above 1.3!  You probably shouldn't trust the mean estimator.
BulkESS is below 10% of total iterations.  You probably shouldn't trust the mean estimator.

Mean StdDev MCMCSE
4.50   3.26      0.12
<maybe don't even show this if the above diagnostics failed>

Median MAD MCMCSE
0.12      0.1    0.01

5%      25%   50%   75%  95%
-1.96   2.28   4.53   6.66  10.80

<repeat for other parameters>

Yes this is much more verbose and doesn’t fit into a table, but it mirrors the sequential structure of what’s actually going on and how users should be utilizing their fit output. The API routes can be expanded to include something like

summary['global_diagnostics']
summary['mu']$diagnostics
summary['mu']$mean_est
summary['mu']$median_est
summary['mu']$quantiles

to allow for efficient programatic access.

Finally we should probably leave the current summary functionality as it is. While we didn’t explicitly promise that the table output would be preserved (and in general it’s bad practice to use undocumented indices when named variables are available) we have not really documented the summary API well enough to be able to say that didn’t promise it. Moving forwards we should be explicit about what is officially part of the API and what is not, but the easiest thing to do until Stan3 is to keep summary and introduce a shiny_new_summary.

6 Likes

Thanks for the correction (I went further and removed the “an”).

Before worrying too much about the detailed message, I want to think about the content. Is the purpose that we want ESS 100 per chain on average? If so, we should say that, because not everyone is going to be running exactly 4 chains.

Why do we want ESS 100 per chain on average? Does it have to do with instability in the ESS estimator and our wanting to be conservative? I’m just worried that if we only need ESS 50 for inference and we wait for ESS 500, we’re going to be doing ten times more work than we have to.

Yes. I was the one objecting to removing means altogether, because we need them to compute expectations for event probabilities, which are expectations of indicator functions (with values 0 or 1).

I was about to rant about geometric ergodicity here, but Michael made that point later about MCMC CLT.

We could use a diagnostic for improper posteriors, too. As is, we rely on convergence diagnostics and the estimates of means blowing up, just as they will for a Cauchy distribution.

This is very problem dependent. How much will these tails being wrong affect our expectation (or median) calculations?

This does bring up an issue of what we try to “guarantee” for backward compatibility. We decided a while back that the exact wording of error messages wasn’t a backward compatibility issue. But this is a big change to the default R output.

One approach is to not change the output at all in RStan 2.X and instead just add new functions to report the new outputs. Then in RStan 3, we can change things and break backward compatiblity without feeling guilty about it.

What will quantiles report for binary variables?

They’re the default for RStan! McElreath devotes a lot of text to trying to use oddball intervals like 89% to get people away from using 95% and interpreting as a conventional confidence interval.

1 Like

Thanks for all for the comments.

My current conclusion is

  • We are not going to change the default output soon.
  • We keep the current summary function as is as there are several people using that as a primary api for diagnostics (and possibly assuming fixed order).

In order to have some progress, I propose

  • We add new functions to compute the old and new quantities, which would provide the recommended api for diagnostics.
  • Some of the computations could be put in Stan services, but what is needed there depends what we want CmdStan summary executable to be able to compute, but that depends on what the summary should output, but that is open question.
  • In interfaces, we make more configurable summary function, so that user can choose the output (which quantiles, which other values and diagnostics, and in which order). We can have some pre-made themes (similarly how ggplot has themes) like classic, gelman, goodrich, betancourt, etc. It is also possible that this kind of functionality could be in a separate diagnostic toolboxes like arviz in Python and some not yet existing R package.

Yes.

Good point. We say that in the paper, but not in that printout.

Yes, it’s due to instability of Rhat and ESS estimators. If we don’t know whether the chains are mixing, we need more iterattiins. If we get independent draws or oracle tells us ESS, then we need less iterations.

That is the price we pay for the inference diagnostics and that is work we have to do to have a bit more trust on inference. In another thread someone asked can they use less iterations when doing replicated simulations. In that case it could be possible to first check that Stan is able to efficiently sample and how efficiently and then run less iterations with ESS estimated from the long run. Even then it’s possible that if the replications use different data, that for some data sampling is less efficient…

What I meant is not that much problem dependent, but more just about the ratio of draws below and above certain quantile. But that how much sampling problems in tails affect posterior mean or median calculations is problem dependent. For example in symmetric distributions, if we miss the same amount of posterior mass in both tails, then we can get small error in mean and median. If we miss some posterior mass just in one tail, then it depends on the integral over that missing mass and the quantity how much mean and median have error. It would be great to be able to diagnose that perfectly, but with finite number iterations we lack the diagnostic accuracy far in the tail. The local and quantile efficiency plots are useful for investigating the tail problems, and with more iterations we can see more. What we would show by default is always a compromise and more cautious user would look more details.

With high probability 0 or 1 for any quantile (0 for lower and 1 for higher quantiles).

1 Like

Wouldn’t we want at least 100 ESS per chain (not on average)? And if we don’t have enough ESS, would it be wise to report \hat R marked with ?? to signal that the estimate is unstable?

It’s easier to estimate Rhat than ESS. Rhat is used to compute ESS but ESS is not needed for Rhat, so ESS shouldn’t determine whether Rhat is shown. The proposed approach is not computing ESS for single chains or split-chains although it would be possible to compute those, too. What I tried to say is that it’s important to look at the total combined ESS, and the “ESS 50 per split chain, or ESS 100 per chain” was just to justify where the good total combined ESS value comes from. It’s not completely clear whether we should require ESS per chain no matter how many chains, or is it fine to have a smaller ESS per chain if we have more chains as then the higher variance in ESS is averaged over more chains and the total combined ESS would be again accurate enough. Two special cases

  • multimodality: It’s possible that each chain alone is efficient in a local mode with high ESS estimate for each single chain. Rhat is high and the combined ESS is very small.
  • high autocorrelation: Rhat can be small, but due to high autocorrelation ESS is small. ESS estimates for single chains have high variance. It doesn’t matter if some would have individual ESS estimate 90 and some 110. If you have estimated that you would need specific ESS for your application, instead of stopping as soon as going over some specific limit, it’s more safe to go much over to be certain then about the accuracy.
1 Like

I wouldn’t worry too much about people relying on fixed order. I don’t think we want to guarantee order of outputs.

This would allow users of the old things to keep going with minimal fuss.

That’s fair.

Yes, please! Especially if they are isolated functions for each diagnostic/summary, in which case we don’t have to worry about ordering or naming conventions of composite arrays.

I think it’s fair to implement any diagnostic/summary to be exposed in RStan in the C++. At the very least this would ensure that all interfaces have access to the same functionality, and ideally it would facilitate all of the interfaces actually exposing the same functionality.

Pushing diagnostic/summary functionality into a separate package gets tricky because it becomes harder for us to curate a small set of recommended diagnostics/summaries. With too many options users become overwhelmed and prone to ignoring good diagnostics and relying on bad diagnostics. I don’t think there’s disagreement on what to recommend here, just on how to present it. Although maybe I’m wrong.

Anyways, I think fragmenting the user experience into “Themes” is a dangerous road to go down and should be considered with great care.

Any idea when can we expect these new diagnostic measures to be available in RStan?

They are in the develop branch but not being reported in print. You would have to call monitor interactively or in a package, tail_ess, bulk_ess, rhat, etc. It throws warnings if any of these things are suspect, but there is now concern that it will be too confusing to the user to throw warnings about things that are not output by print in addition to the concern that changing print would break code that calls print programatically.

2 Likes

Hm, regarding throw_sampler_warnings in 8b3624746f4a017aa81, it would be nice if the code was reorganized a bit so that an API like check_sampler_warnings would check everything and return a vector of strings (if any), and then turning that string vector into warnings would be done by a wrapper. This would make it easier when I’m fitting hundreds of similar models in a simulation and want to summarize whether the sampler converged. That is, I could do something like,

isBad <- check_sampler_warnings(fit)
if (!length(isBad)) {
   # all good here
} else {
   result$error <- paste(isBad, collapse="\n")
}

I guess this is the same idea as https://github.com/stan-dev/rstan/issues/594

Hi,

I am not really sure this is the right place to do so (I can move it to an issue in the github repo for example if it is better there).

There is a typo in the online appendix to the article, in Appendix G: it says "\bar{I}_\alpha has a Beta(\bar{I}_\alpha +1, S -\bar{I}_\alpha +1) distribution", but the \bar{I}_\alpha inside the Beta distribution are not \bar{I}_\alpha but S \bar{I}_\alpha=\sum_{s=1}^S I(\theta^{(s)}\leq \theta_\alpha)

2 Likes

I’d suggest opening an issue on GitHub if you know the repo so that @avehtari will see it.

Thanks @OriolAbril. I’ll fix this. Repo is at https://github.com/avehtari/rhat_ess

Yes, I know the repo. I sent one PR some days ago.

Thanks! I’ll probably see the PR in a few days, after getting my backlog cleared after my vacation.