Ensuring convergence without any graphical diagnostics

Hi everyone,

I am trying to use stan and torsten (a collection of stan functions to facilitate analysis of pharmacometric data) to individualize parameters of a pharmacokinetic NLME model once measurements become available. Ultimately, this should be translated into a proof-of-concept RShiny app. This is how it currently looks like:

You can see a predicted concentration time-profile and these predictions are based on parameter draws from the posterior distribution. One key thing to consider is that users not at all familiar with Bayesian statistics need to be able to use this app (e.g., clinicians). This shall remain a proof-of-concept (not a medicinal product), but I still want to think about how to ensure that the predictions we obtain are reliable (i.e., the MCMC sampler has converged). As we cannot at all rely on any visual diagnostics (this would mean we would have to educate the users how to spot non-convergence), the convergence diagnostics have to be fully numerical and pre-defined. Ideally, we can run these checks automatically before showing any output and display an error in case of convergence issues.

Here are my questions:

  1. Is it possible to reliably check convergence to the actual posterior distribution using numerical checks only?
  2. Which metrics and which threshold should be used for this purpose if we think conservative?

I had a look into the Bayesian Workflow paper and found under 3.2. How long to run an iterative algorithm:

Recommended standard practice is to run at least until \hat{R}, the measure of mixing of chains, is less than 1.01 for all parameters and quantities of interest (Vehtari et al., 2020), and to also monitor the multivariate mixing statistic R^* (Lambert and Vehtari, 2020).

(…) it might seem like a safe and conservative choice to run MCMC until the effective sample size is in the thousands or Monte Carlo standard error is tiny in comparison to the required precision for parameter interpretation (…)

From @avehtari Vehtari et al., 2020:

A small value of \hat{R} is not enough to ensure that an MCMC sample is useful in practice (Vats and Knudson, 2018). The effective sample size must also be large enough to get stable inferences for quantities of interest.

When coupled with an ESS estimate, \hat{R} is the most common way to assess the convergence of a set of simulated chains.

And from Lambert and Vehtari, 2020:

Here, we introduce a new method for diagnosing convergence based on how well a machine learning classifier model can successfully discriminate the individual chains. We call this convergence measure R^*. In contrast to the predominant \hat{R}, R^* is a single statistic across all parameters that indicates lack of mixing, although individual variables’ importance for this metric can also be determined.

It seems that \hat{R}, effective sample size and R^* are commonly applied numerical diagnostics. But I would be interested in your opinion - what would you check in one single function which has to return either “convergence” or “non convergence”?

Thanks a lot for your help!

P.S.: The exact details of the model do not really matter, but you can find the code at Simultaneously estimating IIV and IOV estimate for first cycle in pharmacokinetic NLME model using Torsten) if you are interested.

1 Like

Numerical checks are more reliable than visual checks.

There is no easy answer to that. The first step is to diagnose mixing, and then to check that you have enough many iterations to report the desired number of digits. There is no magical threshold that would guarantee good mixing, so it’s always some compromise. See How many iterations to run and how many digits to report, for how to check whether the number of posterior draws is big enough for the desired number of digits to report.

In automated approach, \hat{R} has the downside that you are looking at many quantities which have some variability given finite sample size and thus more quantities you have more likely it is you see false positives, and then you would need to either adjust the sample size or the threshold.
R^* has the benefit of providing just one value while examining the joint posterior. If R^* looks good, then focus on MCSE as discussed in How many iterations to run and how many digits to report.

2 Likes