New R-hat and ESS

New \widehat{R} and ESS

There’s a new paper discussing improvement for \widehat{R} and effective sample size diagnostics for MCMC

  • Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, Paul-Christian Bürkner (2019): Rank-normalization, folding, and localization: An improved \widehat{R} for assessing convergence of MCMC. arXiv preprint arXiv:1903.08008.

You may also read @anon75146577’s blog post

I think the most important points are

  • The new value of R-hat is robust against heavy tails and is sensitive to changes in scale between the chains.
  • The new value of R-hat is parameterization invariant, which is to say that the R-hat value for \theta and \log(\theta) will be the same. This was not a property of the original formulation.
  • New diagnostic effective sample size (ESS) which has the same properties as the new R-hat
  • Recommendation to look at diagnostic ESS seprately for bulk and tails
  • Separating diagnostic ESS, and ESS needed to compute MCSE for a quantity of interest (now assuming that quantity has finite variance)
  • MCSE estimates for Median, MAD and quantiles
  • Many nice plots for further diagnostics

When considering using these with Stan there are two issues (plots are easy as they go to bayesplot and arviz packages)

  1. what is shown in the default output
  2. where the computation happens

Proposal for the default output

The current default output looks like this:

Inference for Stan model: eight_schools_cp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd   2.5%    25%    50%    75% 97.5% n_eff Rhat
mu         4.50    0.12 3.26  -1.96   2.28   4.53   6.66 10.80   695 1.01
tau        3.86    0.16 3.12   0.55   1.66   3.02   5.22 11.87   385 1.01
theta[1]   6.41    0.17 5.52  -2.88   2.93   5.88   9.19 19.09  1043 1.00
theta[2]   5.05    0.13 4.73  -4.17   2.26   4.99   7.74 14.96  1316 1.00
theta[3]   4.06    0.14 5.36  -7.77   1.09   4.32   7.23 14.52  1418 1.00
theta[4]   4.89    0.14 4.77  -4.77   1.94   4.88   7.76 14.66  1118 1.00
theta[5]   3.75    0.15 4.59  -6.10   0.96   3.95   6.71 12.33  1002 1.00
theta[6]   4.16    0.13 4.71  -6.42   1.51   4.38   7.00 13.00  1280 1.01
theta[7]   6.44    0.16 4.99  -2.13   3.19   6.04   9.24 18.37   947 1.00
theta[8]   4.99    0.14 5.20  -5.41   1.93   4.86   7.85 16.44  1357 1.00
lp__     -15.03    0.43 6.07 -26.35 -19.40 -15.31 -10.74 -2.91   201 1.02

Samples were drawn using NUTS(diag_e) at Thu Mar 21 16:51:05 2019.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

A new proposal is (there are more examples in the online appendix)

Inference for Stan model: eight_schools_cp.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

             Q5    Q50   Q95   Mean   SD  Rhat Bulk_ESS Tail_ESS
mu        -0.93   4.53  9.85   4.50 3.26  1.01      695     1124
tau        0.71   3.02  9.79   3.86 3.12  1.02      218      229
theta[1]  -1.24   5.88 16.01   6.41 5.52  1.01      991     1590
theta[2]  -2.44   4.99 12.84   5.05 4.73  1.00     1247     2033
theta[3]  -4.99   4.32 12.13   4.06 5.36  1.00     1168     1787
theta[4]  -2.91   4.88 12.73   4.89 4.77  1.00     1024     1905
theta[5]  -3.99   3.95 10.80   3.75 4.59  1.00      951     2013
theta[6]  -3.97   4.38 11.45   4.16 4.71  1.01     1196     1701
theta[7]  -0.91   6.04 15.06   6.44 4.99  1.00      936     1583
theta[8]  -3.08   4.86 13.78   4.99 5.20  1.00     1287     1492
lp__     -24.50 -15.31 -4.61 -15.03 6.07  1.02      210      240

For each parameter, Bulk_ESS and Tail_ESS are crude measures of 
effective sample size for bulk and tail quantities respectively (good values is 
ESS > 400), and Rhat is the potential scale reduction factor on rank normalized
split chains (at convergence, Rhat = 1).

In addition, although not shown by default method the returned object would include values for

Comments on the default output or what is computed?

What is the process to decide how and when the default output changes?

Where the computation happens

Currently CmdStan, RStan and PyStan compute the quantities shown in the default output in C++, but they use three separate C++ files for that so it’s not happening in just one place. To make the default output same the computation needs to be implemented in C++ at least for CmdStan. The same C++ code can be copied to C++ files used by RStan and PyStan. I would keep it as a separate discussion how to make the computation to happen only in one place, as there seems to performance issues in which data structure types are used in Stan, RStan and PyStan, and at least I don’t know how to fix this.

Anyway, I can add computations to C++ in the current way or after someone has made refactoring of the code. Although we can use one computation code for different Stan interfaces, it’s likely that the there will be R and Python code as well as bayesplot and arviz would have new plots needing these computations and both of these packages are generic beyond Stan and probably these packages don’t want include dependency on Stan C++.

Comments on where the computation happens?


I am not as big a fan of having Q5 be immediately next to the margin name. For a long time, users have gotten accustomed to the first column to the right of the margin name being “the important one” and the 5% quantile is perhaps the least important of the various columns. Also, I don’t know if we should be defaulting to 5%.

I might order them as Q50, Mean, Bulk_ESS, Rhat, SD, Q_small, Q_large, Tail_ESS.


I agree the computation needs to happen in one place. PyStan 3 is
planning on using the “Analysis API” to calculate effective sample size,
see . I gather there is no
equivalent function available for rhat yet – or am I wrong about this?

Ideally, there should only be one implementation of split R-hat that can be called like a service. In practice, it has been a bit difficult because CmdStan, RStan, and PyStan do the IO pretty differently. Ideally, RStan would want to pass by reference, but that is not super important for split R-hat if it is being calculated one parameter at a time.

But the split R-hat should calculate on whatever is passed to it. I think it is fine if the interface is responsible for calculating median absolute deviation from the median or whatever and passing that to the split R-hat entry point. For functions that are already readily available in the interface, very little work needs to be done in order to call them. But some of those would need to be implemented in Stan / CmdStan.

Just testing some ideas… Maybe 2.?

  1. Q5 Q50 Q95
  2. Q05 Q50 Q95
  3. Q005 Q050 Q095
  4. Q050 Q500 Q950
  5. Q_5 Q_50 Q_95
  6. Q_05 Q_50 Q_95
  7. Q_005 Q_050 Q_095
  8. Q_050 Q_500 Q_950
  9. q5 q50 q95
  10. q05 q50 q95

That effective sample size computation is computing the components of rhat, but just using them only for ess computation. I was aware of this pull request, but delayed commenting as I assumed we would have finished the paper much earlier. Based on a quick look that pull request moved the computation from chains.hpp to compute_effective_sample_size.hpp and to analyze namespace. chains.hpp has also function split_potential_scale_reduction, which could also be moved to analyze namespace. So there is equivalent function available, but not yet in analyze namespace. Also note that in the current code rhat is based on split chains, but ess is not.

Great work on the improvements to R-hat and ESS.

I’m in favor of changing the defaults, and I vote for Mean, SD, Rhat, Bulk_ESS, Q_small, Q_50, Q_large, Tail_ESS.

Justification: I agree with bgoodri above that the first column to the right of the margin name should be the “important one”. This suggestion is really just starting with Aki et al’s ordering but moving the quantiles to in between Bulk_ESS and Tail_ESS.

@avehtari, ping me later if you would like some help updating the analyze namespace to adopt these new functions.

1 Like

Here’s my minor contribution. I noticed one minor grammatical change. In good values is ESS > 400 the phrase can be made better with something like:

  1. good values for ESS are ESS > 400
  2. an ESS > 400 is considered good

Thanks for all great updates!

Because I and others have code that assumes means are the first column (or more generally, we have code that assumes a certain column order of summary(fit)$summary; but I especially have relied on summary(fit)$summary[,1]) I’d prefer it if the general ordering remained the same: mean/sd/quantiles/n_eff/Rhat.

1 Like

Why aren’t you doing summary(fit)$summary[,"mean"]?

1 Like

No good reason; probably because it’s shorter and it’s been a stable structure for such a long time.

Generally though, I don’t see a good reason to move the mean to a different location at all. With how much emphasis is placed on posterior expectations, why would the mean be 3rd/4th/5th (depending on which quantiles are computed)?

The posterior median is the right quantity to be looking at if your loss function is absolute error, whereas the posterior mean is the right quantity to be looking at if your loss function is squared error. But those loss functions generally do not involve most of the margins in the output, which are only important for judging whether the chains have converged and mixed well enough.

1 Like

Gah, I think this is a terrible reason to preserve column ordering. Order the columns the way that makes the most sense and let this code break.

The bigger question, to me, is why change the column ordering? It seems arbitrary, but people have grown to expect certain orders. It’s not a huge hassle to change the code to specify the column, the broader point is that 1 is expected to be the mean. Visually, it’s what I and others look for first, and if there’s not a great reason to change the column order, I don’t see why the order should be changed. It also matches up better with other R packages’ outputs (est., se., intervals, diagnostics).

1 Like

I think the general principle here is that more important things should be more to the left. The mean of an important quantity is important (conditional on the chains having converged and mixed well), although not necessarily the most important, but in the current output the things that follow the mean column are not as important as the ESS (which is second to the right) and does not include things that we now realize are important like Bulk-ESS, Tail-ESS, etc.

1 Like

Thanks @spinkney for pointing out the error

For others, it seems I should have written more about the justification of the proposed default output.
And please keep separate the default output (including the order of columns in the default output) and the underlying data structure (including the order of columns). Let’s focus now on the default output, and what may be computed for the underlying structure but don’t mix programming issues here (especially bad programming habits).

Also when we discuss about the default output, it’s not meant to be the only thing you can use from MCMC sample and the only thing to report in the paper. It’s meant for quick diagnostic and quick checking whether the posterior seems reasonable. For proper reporting you should consider what is appropriate for the specific application and we can’t have all the possible choices in the default output.

If I remember correctly there were Andrew, Dan, Bob, Paul, Ben, Jonah and some other present when we discussed the design issues for what to show and in which order by default (and again remembering that you can modify what you see with options and other functions etc)

  1. The order of importance: 1) diagnostics, 2) if diagnostics are ok, then they are boring.
  2. The diagnostics have to be together as they need to be checked before anything else. These diagnostics are generic for MCMC and they are not specific for any quantile, mean or sd, so there is no need to try to put them next to some of those. Then we need to choose whether to put them first or last. The decision was to put them last as they are boring if everything is ok. So it’s first quick to find them, but if they are boring they are easy to ignore.
  3. Quantiles, mean and sd have associated MCSEs (remember that Rhat, bulk-ESS and tail-ESS are generic). We considered showing MCSEs, too, but that took way too much space and are often quite small. MCSE are available in the columns of the returned object. Furthermore the user should focus on MCSE for the quantities of interest, and there are functions to compute them for any quantity of interest. Of course MCSE for mean and sd make sense only if mean and sd exists.
  4. I was ready to drop mean and sd as they do not always exist, but others thought that it would be a too big change to drop them. As a compromise they were moved from the first columns to after quantiles which always exist. We considered diagnostics for estimating when mean and sd are well defined, but that is very difficult problem and too noisy to be shown as default for many parameters. If the user wants to report mean and sd, they should check that they are well defined for the specific quantities and it’s easier if focusing only on a small number of means. If someone proposes mean to be in the first column, I ask them to also propose a diagnostic for estimating whether the mean exists.
  5. Quantiles are special as they exist for proper distributions. Q5 and Q95 are not there to hint that you should report 90% central interval. They are there for quick diagnostic whether the sample is sensible. It’s possible that MCMC is working well, but you have error in the model or data, and Q5, Q50 and Q95 give a quick impression of the posterior. Thus Q5, Q50, and Q95 are equally important. For reporting your results, you are free to choose any other quantiles (or mean and sd if they exists)
  6. Quantiles are not always useful. A most common specific case is a binary variable in generated quantities which expectation is used to estimate a probability. For those mean exists and is much more useful than any quantile. This was the deciding factor to keep the mean in the default output. For other discrete distributions with a small number of states, you probably need something more specialized.
  7. Q50 is median it’s udeful to show as it exists even if mean doesn’t exist. But why then Q5 and Q95? I repeat that it’s not to imply that you should report these specific quantiles in the paper. One good reason is that tail-ESS is minimum of Q5-ESS and Q95-ESS. So why Tail-ESS is using these two specific quantiles? It’s because the further in the tail quantile-ESS is the more sensitive it is a diagnostic for sampling problems in tails. If we go too far in the tails we don’t have enough draws in the tail and accuracy of the quantile-ESS diagnostic drops. Q5 and Q95 are empiricially found to be a good compromise considering the default number of iterations used in Stan. Thus Q5 and Q95 connect also to the diagnostic and what is feasible to compute with the current recommendations for the number of iterations to run. If you need to report more extreme quantiles, then it’s likely that you need to run MCMC longer and you can use quantile-ESS plots to analyse in more detail the sampling efficiency in tails. If you prefer to report other than Q5 and Q95, there is a function which computes all the quantiles you want and corresponding ESSs and MCSEs for those.
  8. There is no fixed order for importance of quantities and that’s why the ordering includes also conceptual grouping of items.
1 Like

Ben – I am doing as you suggest (referring to columns by name), but then there are also proposals to change the quantile labeling…

It’s a usability concern for me. To strike a balance between old and new output/diagnostics (and to minimize code breakage, e.g., if people assume 2.5% is a column name, which it has been by default; or mean vs Mean, or sd vs SD): Why not:

mean|sd|[Selected Quantiles, by default, 2.5%, 50%, 97.5%] | Bulk_ESS | Tail_ESS | Rhat

Then you still have posterior descriptions in the left, diagnostics to the right, in the order (and names/casing) people have been accustomed to.
This seems like the least disruptive way of introducing new Rhat and the expanded ESS quantities.
One may not like the mean/sd because it is not always defined, but I am guessing that most models fit by most users have a defined posterior mean and variance, although I could be wrong.

As for the quantiles, it would be somewhat aggravating, to me as a user, to manually specify that I want .025 and .975 every time, or run a function afterward and create my own summary table each time. I know .05 and .95 are used for the tail_ESS quantity, but from a user perspective, the summary table very conveniently offered the most used interval of interest. In the very least, please avoid changing the naming convention of the quantiles and columns: 5% is much clearer than Q5; ‘mean’ is no different than ‘Mean’, but preexisting code assumes ‘mean’.

Obviously, you all will do what you think is right. But these are thoughts from one user’s perspective. Changing the output to the proposed format will be unnecessarily frustrating, if only because it loses default columns, the order is changed, and the names differ.

I repeat: keep the default output and column names of underlying structure as two separate issues. The underlying structure can have more columns than what is in the default output. I hope no-one is parsing the values from the string output. 2.5% should not be cast in stone anywhere and summary and monitor functions allow also other probabilities. Most of the non-default quantities should be computed with functions and not pre-computed and stored in the structures.

I object 2.5% as a default diagnostic value as it would require more draws than the current default recommendation. I don’t mind people reporting 2.5% after they have checked that they can estimate it well. Don’t try to force the default output to be what you want to report as the end of your analysis.

Do you feel lucky? I don’t like that guessing part, I would prefer diagnostics even if they are not perfect. We have plenty of models in wild using, e.g., Cauchy and half-Cauchy priors. If the data happens to be non-informative then the posterior has infinite mean and sd. And there are other examples. The whole point of the new Rhat and ESS’s is that there are cases where mean and sd do not exist, or sd is very large, and thus mean and sd are difficult to estimate while quantiles are easier.

I would be more comfortable if had reliable diagnostic for checking finiteness of mean and sd. I mentioned that quantiles are not good for binary variable, but in that case it’s very obvious and that self-diagnosing.

We can easily change the code so that you can set in .Rprofile your preferred quantiles. We can also add options for the ordering of the columns in the output, etc. RStan is already using internally default_summary_probs <- function() c(0.025, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.975), but it seems it’s not possible to override that yet from .Rprofile.

Most used by who? I don’t see 2.5% and 97.5% that often.

The change suggested by @paul.buerkner was to make what is shown to match the underlying column name which would work without escapes in tidyverse. I don’t have strong feelings on this. For me 5% and Q5 are equally clear. Good point about Mean vs mean in the underlying structure. We can show different strings in the output and use different strings as column names in the underlying structure. The tradeoff is between what looks prettier and what makes it easier to remember the column names in the underlying structure.

There is no one right here. I appreciate that you have spent your time to comment here, it’s helpful. We are going to make some compromise, but hopefully we can have some progress and not get stalled just because any change is going to cause some extra work for someone.

Would it make it easier for you if the change would be only for the default output, but you could still configure your output to look as you wish? It would be quite easy to add configuration options for the default output. The underlying structure should have some backward compatibility given good programming practices have been used in the packages using it, but we are also introducing new quantities and to benefit from those, some code changes may be necessary.