A new convergence diagnostic with Ben Lambert “R*: A robust MCMC convergence diagnostic with uncertainty using gradient-boosted machines” https://arxiv.org/abs/2003.07900. The idea is to use machine learning classifier for multivariate convergence diagnostics (e.g. Rhat is usually for univariate)

Machine learning classifier (here gradient-boosted regression trees) is used to try to classify which draws come from which chain. If chains are mixing well, it’s not possible to beat random guessing. In case of bad mixing it’s possible to separate draws from different chains.

Ben had the idea and wrote the first version and then contacted me for comments. I was very skeptical as machine learning classifiers can be sensitive to algorithm parameters. I proposed additional experiments and after several iterations and more experiments I was convinced.

The benefit is that this can be done once for all parameters, and can detect mixing problems which may not appear in marginals. The classifier is non-parametric and doesn’t assume finite variance (like old Rhat). We believe it’s going to be useful complementary approach.

@avehtari, would you say something a bit more about why you thinned post-warm-up iterations? Sometimes there was no thinning, sometimes thinning by a factor of 3, and other times by a factor of 5. It seems that in general this community suggests thinning is not necessary when using Stan.

Was it an issue with autocorrelation? Or more to do with the computational complexity of boosted regression trees, maybe computational time, memory constraints, or both? All of the above?

Thinning reduces information, so if thinning is not needed then thinning is not recommended

In many cases the dynamic HMC in Stan is so efficient that the default number of iterations provides sufficient accuracy and there is no memory issues

In some cases dynamic HMC in Stan can also have so high autocorrelation and if there are many parameters, it may be beneficial to thin to save disk space, memory and computation time for derived quantities.

In some cases we really need almost independent draws (e.g. SBC), and then we need to thin also antithetic chains (which have better efficiency than independent draws for certain expectations)

So the generally seend recommendation to not thin holds often but is not the recommendation for every case.

Thanks, @avehtari. If you will, a few follow up questions. If you had infinite computing resources, would you have thinned? For a fixed model did R*'s answer to the question, have the chains converged?, vary depending on the amount of thinning? How did you choose the factor x to thin by?

With infinite computing resources we don’t need convergence diagnostics. We can choose trivial slow algorithms that in infinite time produce the exact expectation.

R* is not sensitive to thinning that is it’s not sensitive to the autocorrelation. Ben did test the effect of the autocorrelation.

Ask details from Ben, but the choice is mostly arbitrary and for computational convenience. It would be different if approximately independent draws would be needed, but that is not case for R*.