ESS/Rhat behavior for chains numerical consistent with constants

Right now the ESS and Rhat codes treat chains where a parameter value is approximately constant slightly differently. The ESS code computes autocorrelation estimators and computes an asymptotic variance/ESS estimator naively, while the Rhat code will reduce to 1 before any other calculation.

We should make this behavior consistent, but the question is what is the ideal behavior?

If the function whose expectation is being taken is truly constant than \mathbb{E}_{\pi}[f] = f exactly and the MCMC standard error should be zero. In that case we could report \hat{R} = 1 and \mathrm{ESS} = \infty.

On the other hand the trace might be numerically consistent with a constant because the chain was just dreadfully slow, in which case the autocorrelation/asymptotic variance/ESS estimators, as well as \hat{R} shouldn’t be trusted.

Without external information there’s no way to distinguish between these two possibilities. We know that we have actual constants in practice, coming from constrained types or generated quantities, and dynamic HMC should be fast enough that we shouldn’t see a trace with zero variation when it should be varying, which would suggest going with the former behavior. That said my preference is to be conservative and not make a claim between either behavior, instead throwing a NaN and letting the user apply their domain expertise as to which possibility it might be.

Any thoughts on this matter? @avehtari? Daniel_Simpson?

When considering different options, we thought some users might think reported ESS = Inf worrysome, and thought that ESS = S (where S is the sample size) would be better indication that there is no error. With ESS = S and constant function, variance is 0 and thus MCSE is also 0. I think this solution is good if we assume that we see only true constants, and if we see fully stuck chains then other diagnostics such as divergences or other parameters having bad behavior would indicate sticking problems.

1 Like

I found the related pull request discussion, and we specifically didn’t want to use NaN for constant case. It is also possible that Stan returns NaNs or Infs, and we thought it’s useful if the diagnostics show NaN/Inf if the values include NaN/Inf so that we can see something bad is happening, Thus NaN/Inf would not be good diagnostic values for constant finite values.

Who is we?

The problem is that ESS = S is not particularly diagnostic in the era of antithetic chains. There is nothing suggesting that the user take a closer look. NaN/Inf values only in the Rhat and ESS also would be ambiguous to chains with NaN/Inf values because the latter would also have NaN/Inf means that would clearly separate them out.

Sorry, I wrote in a hurry. Vehtari, Gelman, Simpson, and Bürkner as in https://arxiv.org/abs/1902.06281

I don’t see the problem considering what I wrote above. I’ll get back to this next week when I’m back in Aalto.

Let me be more explicit.

First let’s consider the asymptotic variance. Naively one might say that for a constant function the samples are exact so that the summed autocorrelations equal 1 identically. But more formally for a constant function each sample is perfectly correlated with every other sample so the sum of autocorrelations equals infinity, and the effective sample size always equals 0.

When computing the MCMC standard error there is another ambiguity as any finite value for the effective sample size will yield the same standard error of zero when the variance is zero. Even taking the vanishing effective sample size argument from above one would get 0/0 and hence an ambiguous value. This “there is no unique answer” circumstance is technically what NaN corresponds to, which is why NaN is a natural choice for the effective sample size.

At the same time if one uses NaN then the simple MCMC standard error calculation won’t give the desired value. This is easy enough to check for, NaN effective sample size but finite variance which is different from having NaN states in the chains which would give both NaN effective sample size and NaN variance, but does require additional checks that ensure these special cases are treated consistently through the functions.

To simplify things I’m fine with returning some default, finite value of effective sample size with the chains appear constant within floating point precision. But I’m not comfortable with returning ESS = N. I would much rather return ESS = 1 or something closer to the zero value that is closer to what would happen in an exact calculation.

I’m back from StanCon and weekend. I’ve thought a bit more about constant cases and I’m willing to reconsider my previous suggestion

  1. True constant as part of the parameterization, for example, as in cov_matrix or user defined in transformed parameters.
  2. Not constant but with finite number of draws likely to have only one unique value, for example, if p(theta<0) is very small then in generated quantities I = theta < 0 ? 1 : 0 can with finite draws have all draws equal to 0.
  3. Not constant, but the sampler is stuck. If there are several chains initialized with different initial values, it is unlikely that after warm-up all chains would have same value (especially when having continuous parameters), but I think we should consider that someone is using our diagnostic functions only with one chain.

When were were implementing the code for new Rhat and ESS, if I remember correctly we only considered case 1 as that and previous output had caused confusion among users. For 1. we thought we should not give values which would make the users to worry and for case 1. we then thought ESS=S would be good. Now after realizing that we can’t know from the posterior sample object whether the parameters having just one unique value belong to 1. or 2., I think that it would be better to let the users to worry a bit and check whether the constant is due by definition or due to having a finite sample size. ESS=1 could be a good choice for this.

A problem with ESS=1 is that the case 1. is probably the most common of these, and then the default diagnostic warnings would warn users that the smallest ESS is too low. NaN would be better as we can then check with min(..., na.rm=TRUE), and give more detailed warnings. If there are NaN’s in ESS, then it would be up to user to check whether these are due to NaN’s in values, or one of the cases 1., 2. or 3. It’s possible that introducing NaN’s will break some code.

I’m tagging @paul.buerkner, @Daniel_Simpson, @jonah, @bgoodrig

I support NaN in principle but, as you said, this may break some code. For instance, the ess handling functions in loo/bayesplot have a tendency to break for NA values. Still I think this is the most consistent option going forward.

As discussed at stancon, what we need is a package that handles all the convergence and related functionality in a consistent manner. If all other of our packages then rely on this functionality we can easily enforce the desired behavior and related handling of exceptions (be it NaN or something else).

btw. when we use csv for intermediate strorage as in CmdStan and interfaces using CmdStan, we (currently) have less precision than floating point.

The only upstream functions that consume effective samples sizes are the MCMC standard errors and diagnostic checks, both of which can be made robust to returning a NaN. The standard error code can compare the variance and effective sample size, returning 0 only if the variance is zero and returning NaN if both are zero. Diagnostic functions can return warnings noting the ambiguity and asking the user to verify what is expected to be constant. Really the only cost is repeating the constant array checks, but splitting things up like this will inevitably require redundant calculation (and everything is swamped b more executive redundancies like computing the variances many times).

Sure, but we’re limited by that regardless. Losing precision just means that more things look like constants. Returning ‘NaN’ consistently puts the onus on the user to verify what is supposed to be constant and what isn’t.

The problem there is that it’s what you could get from a true random generation from the process — all zeroes and thus an effective sample size of N by definition that our algorithm just can’t sort.

Will the sampler ever get stuck and not diverge?

Maybe rather than NaN we can print ?? or something? NaN makes it look like an algorithm failed somewhere.

Just to clarify – this behavior will be for how ESS and Rhat are calculated. Diagnostic utilities like individual checks and summary utilities like print can and should report more context to the user so that they know the possible interpretations.