Suggestions on reporting HMC diagnostics for scientific publication

Hi there,

I am currently writing a manuscript in which I used NUTS to sample the model. I wonder if there are any recipes for reporting the MCMC diagnostics for scientific publication? I have read https://cran.r-project.org/web/packages/bayesplot/vignettes/visual-mcmc-diagnostics.html on Visual MCMC diagnostics using the bayesplot package and https://mc-stan.org/docs/2_25/cmdstan-guide/diagnose.html using cmdstan_diagnose(). However, I lost on the selection of diagnostics because there are so many of them and different ways and different parameters. I’d not like to distract readers from tons of diagnostic plots. Besides, Bayesian especially using NUTS, is not yet the mainstream inference in my field (neuroimaging).

Do you have any suggestions? What I could think of so far is one figure (2x2) combining

  1. a plot on divergence using mcmc_nuts_divergence(np_cp, lp_cp) or stan_diag(object, "divergence") ;
  2. a plot on Rhat using mcmc_rhat() or stan_rhat;
  3. a plot from stan_diag(object, "sample") to show lp and acceptance;
  4. a plot of mcmc_neff() to show ESS.

Another question, do you think it is legitimate to show the Rhat and neff/N for only key parameters? I have more than 100 parameters, planning to show key ones in the main context and a version with all parameters rhat and neff/N in supplementary.

Thanks a lot,
ZC

Firstly, Stan does not use NUTS but rather a more advanced version of dynamic Hamiltonian Monte Carlo.

I personally recommend checking the diagnostics included under check_all_diagnostics in https://github.com/betanalpha/knitr_case_studies/blob/master/stan_intro/stan_utility.R. Visualizations are much better for exploring problems than trying to communicate that no problems could be found; for more see https://betanalpha.github.io/assets/case_studies/markov_chain_monte_carlo.html#44_diagnosing_equilibrium and https://betanalpha.github.io/assets/case_studies/identifiability.html.

Thanks for the reply. Could you please explain more on the first comment? I am confused that it is not using NUTS, of course, I am relatively new to all this technique. The following is the summary of my model fit and it says samples drawn by NUTS… I used rstan by the way

Inference for Stan model: HMAnalysis_noncentered.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
…
…
Samples were drawn using NUTS(diag_e) at Sun Dec 13 14:23:46 2020.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

This is an unfortunate vestigial warning from the early days of Stan where we were using the NUTS algorithm as specified by https://jmlr.org/papers/v15/hoffman14a.html that has not yet been changed as the Stan has continuously improved. The a more recent version of the algorithm used in Stan is discussed in Appendix A.5 of https://arxiv.org/abs/1701.02434, although there have been some additional improvements still. Because of this consistent evolution I haven’t bothered trying to push new names, instead referring to the more general “dynamic Hamiltonian Monte Carlo” and references to the particular Stan version so that the exact algorithm implementation can be identified.

Thanks. It’s really good to know. Honestly, I did not know this and always thought that I am using NUTS… will read https://arxiv.org/abs/1701.02434 and change my description of sampling accordingly. Maybe it would be nice to clarify this somewhere (I understand it is tricky since Stan is always improving), at least an application paper published in 2020 I am reading these days also mentioned using NUTS for sampling.

One thing to note is that if you report lp__, it might not be the actual posterior log probability, but a quantity that is off by a constant, because terms that don’t depend on parameters are ignored if you use ~ sampling statements

1 Like

It is understandable that people think it is NUTS because for example the documentation of rstan (2.21.2) says

RStan is the R interface to the Stan C++ package. The RStan interface (rstan R package) provides:

  • Full Bayesian inference using the No-U-Turn sampler (NUTS), a variant of Hamiltonian Monte Carlo (HMC)

and the algorithm argument of rstan::sampling is documented as

One of sampling algorithms that are implemented in Stan. Current options are "NUTS" (No-U-Turn sampler, Hoffman and Gelman 2011, Betancourt 2017), "HMC" (static HMC), or "Fixed_param" . The default and preferred algorithm is "NUTS" .

I don’t know what the algorithm is called in other interfaces but it would be good to change the terminology if Dynamic HMC is a better term. How about changing the default argument to something like "DHMC" and maintain compatibility so that if algorithm = "NUTS", it is interpreted as DHMC?

4 Likes

I am not sure which way is the best for the development team and it is indeed tricky to keep everything updated. Besides, from what I understand in Appendix A.5 of [1701.02434] A Conceptual Introduction to Hamiltonian Monte Carlo by @betanalpha, it is not so wrong to use the term NUTS but more precise or safe to say “dynamic Hamiltonian Monte Carlo”, since the current version is sort of an improvement on top of NUTS:

Although Stan (Stan Development Team, 2017) was first developed around this originalNo-U-Turn sampler, more recent releases have adopted some modifications. In addition tothe generalized No-U-Turn termination criterion, the Hamiltonian Monte Carlo implemen-tation in Stan uses multinomial sampling from each trajectory instead of slice sampling,which provides a significant improvement in overall performance.

1 Like

Yes, the current documentation is problematic.

Historically the core Stan library and its interaction with the interfaces was built around the initial No-U-Turn implementation, and so “nuts” was hardcoded all over, including many of the warnings and function names. Very quickly the dynamic Hamiltonian Monte Carlo implementation evolved past that initial No-U-Turn implementation but none of the names changed because it avoided having to propagate changes to the surrounding code and interfaces.

I was never picky about the name until people started referring to the No-U-Turn paper as the particular algorithm driving Stan in papers. Even then I didn’t really consider changes until papers started showing up that compared new algorithms to hand written implementations of the No-U-Turn sampler from the original paper but claimed that the comparison demonstrated better performance than Stan.

The problem is that regardless of the appropriateness of terminology the “No-U-Turn” sampler has become synonymous with the paper and hence not appropriate when discussing Stan. Not only has Stan evolved, that evolution is not minor. The current Stan implementation used a richer theoretical understanding of the method to generalize the important components of the “No-U-Turn” method, so it’s not really an upgraded version. Even the “No-U-Turn” condition has been modified to be more compatible with the geometry!

Because the naming is so integrated throughout the code base it’s a nontrivial change, and there were plans to update it to something more generic when incorporating breaking changes to the language, interfaces, and core library in a “Stan3” update. Unfortunately various problems arose that have kept that upgrade at bay for the past few years.

Anyways, for the time being I strongly recommend not referring to the “No-U-Turn” sampler unless you’re referring specifically to that algorithms derived in the original paper. When referring to Stan it’s much less prone to misunderstandings if you refer instead to “an implementation of dynamic Hamiltonian Monte Carlo” and a Stan version so that the exact algorithm used can be identified unambiguously.

1 Like

Discussion like this happens every once a while, like here, here, and here. I myself got frustrated when have to dig into PRs and code to figure out what’s done, and lean comment doesn’t always help to understand why. I just wish there were a doc somewhere to summarize how we’re past original nuts because I can see the question will be asked again and again.

2 Likes

That was very much the intention of the conceptual introduction paper, https://arxiv.org/abs/1701.02434, and you can see long it took even while skipping many of the critical details.

One of the key challenges is that the full sampler is comprised of many interlocking parts, the design of which is motivated from different theoretical analyses and the implementation of which is a bit scattered due to the modular design of the code (this is also why there are so many API routes). The modularity was much more important early on in the development of Stan when it wasn’t clear what the optimal implementation of Hamiltonian Monte Carlo would be.

I can try to put together a short description and comparison to the No-U-Turn® sampler but it will necessarily be superficial given all of the details that go into the construction of the fullalgorithm (choice of kinetic energy, numerical integration, tuning of numerical integration, tuning of numerical trajectory lengths, correcting for numerical error, etc).

5 Likes

May I ask another question regarding the criteria of Neff? In Visual MCMC diagnostics using the bayesplot package, it mentioned the ratio of neff over N should be larger than 0.1

“These particular values are arbitrary in that they have no particular theoretical meaning, but a useful heuristic is to worry about any neff/N less than 0.1”

Whereas, in the case study Robust Statistical Workflow with RStan, it says that 0.001 is the threshold

When we generate less than 0.001 effective samples per transition of the Markov chain the estimators that we use are typically biased and can significantly overestimate the true effective sample size.

This threshold is also used under check_all_diagnostics in https://github.com/betanalpha/knitr_case_studies/blob/master/stan_intro/stan_utility.R

# Checks the effective sample size per iteration
check_n_eff <- function(fit, quiet=FALSE) {
  fit_summary <- summary(fit, probs = c(0.5))$summary
  N <- dim(fit_summary)[[1]]

  iter <- dim(rstan:::extract(fit)[[1]])[[1]]

  no_warning <- TRUE
  for (n in 1:N) {
    if (is.nan(fit_summary[,'n_eff'][n])) {
      if (!quiet) print(sprintf('n_eff for parameter %s is NaN!',
                  rownames(fit_summary)[n]))
      no_warning <- FALSE
    } else {
      ratio <- fit_summary[,'n_eff'][n] / iter
      if (ratio < 0.001) {
        if (!quiet) print(sprintf('n_eff / iter for parameter %s is %s!',
                          rownames(fit_summary)[n], ratio))
        no_warning <- FALSE
      }
    }
  }
  if (no_warning) {
    if (!quiet) print('n_eff / iter looks reasonable for all parameters')
    if (quiet) return(TRUE)
  }
  else {
    if (!quiet) print('  n_eff / iter below 0.001 indicates that the effective sample size has likely been overestimated')
    if (quiet) return(FALSE)
  }
}

If I use neff_ratio() to diagnoise my fit, it will give me few parameters having neff/N < 0.1, but the report from check_all_diagnostics says

n_eff / iter looks reasonable for all parameters

I understand that the definition of a good number of effective samples is sort of arbitrary, but these two threhoslds are too different. Could it be possible that neff_ratio() is using a percentage? so 0.001 is 0.1%?

If I understand well, the neff is kind of an estimation of the number of independent samples for one chain, then is it reasonable to use the definition of Monte Carlo error mentioned in BDA3 3rd edition P267 to choose the neff threshold according to the acceptance of the Monte Carlo error of a certain problem? For instance, saying if we have 100 samples drawn independently, Monte Carlo error introduces just a factor of 1.005 to the uncertainty coming from actual posterior variance, and this error is acceptable for a certain research question.

Thanks,
ZC

I dig into the code mcmc-diagnostics.R used in Visual MCMC diagnostics using the bayesplot package from bayesplot/mcmc-diagnostics.R at master · stan-dev/bayesplot · GitHub. According to line 461 to 472, it is not taking a percentage of Neff/N

diagnostic_color_labels <- list(
  rhat = c(
    low  = expression(hat(R) <= 1.05),
    ok   = expression(hat(R) <= 1.10),
    high = expression(hat(R) > 1.10)
  ),
  neff_ratio = c(
    low  = expression(N[eff] / N <= 0.1),
    ok   = expression(N[eff] / N <= 0.5),
    high = expression(N[eff] / N > 0.5)
  )
)

Then there is a 100 times facotor of the criteria for Neff between check_all_diagnostics in knitr_case_studies/stan_utility.R at master · betanalpha/knitr_case_studies · GitHub and mcmc-diagnostics.R . Which one should be correct? or maybe the understanding of this criteria changed over time?

Thanks.

neff/N < 0.1 is an extremely aggressive threshold that even Hamiltonian Monte Carlo is can fail when working well.

You are correct that the effective sample size determines the Markov chain Monte Carlo estimator standard error which when compared to a Monte Carlo error provides a quantification of how large of an independent ensemble would be needed to produce effectively the same error.

That said there are a few compilations. The first is that the Markov chain Monte Carlo estimator standard error is only well-defined when a Markov chain Monte Carlo central limit theorem exists, which we absolutely cannot take for granted. The second is that even when a central limit theorem exists we can’t evaluate the exact effective sample size and instead have to estimate it using the empirical autocorrelations in the realized Markov chains. For much more on both topics see Markov Chain Monte Carlo in Practice.

Estimation of the effective sample size is tricky, and it can be especially problematic when the Markov chains exhibit extremely strong autocorrelations. In this case the effective sample size estimators become extremely noisy and are prone to drastically overestimating the exact effective sample size. My recommended check for neff / N < 0.001 is a heuristic based on historical empirical evidence meant to identify cases where the effective sample size estimator is likely to be untrustworthy. See Markov Chain Monte Carlo in Practice for more.

On the other hand the neff/N < 0.1 in bayesplot largely checks if the effective sample size is large enough to produce relatively precise Markov chain Monte Carlo estimators for the default Markov chain lengths in Stan. I strongly advise against checks like these – if you’re interested in whether your estimators are accurate enough for a given analysis then just look at the Markov chain Monte Carlo standard errors directly. After checking for any signs (large Rhat, divergent transitions, low E-FMI) that a Markov chain Monte Carlo central limit theorem might not exist, of course!

1 Like

Thanks a lot. Michael. This is very helpful.

1 Like