After running 4 chains on some model (1000 warmup iterations, 1000 sampling iterations), I ran summary diagnostics on each chain separately, as well as on all 4 chains.
I found that for some estimands, Rhat was large (>1.1) in a single chain, but satisfactory (<1.05) when calculated on all chains. I’m wondering what to make of it. Should I be happy with the results, since the variance in estimating Rhat is larger when using a single chain? Or should I be sad (!), since the definition of Rhat involves averaging across chains, thereby possibly washing out “bad” effects that are only apparent in a single chain?
I should also note that this is a model with many estimands of interest (many tens of thousands), so possibly I should actually use some correction to the Rhat threshold as suggested here or forget about Rhat completely and use MCSE.
Any help would be greatly appreciated!