The future you’re talking about is here for a lot of us with access to good cluster computing with decent nodes. If you want to see some research around this, see the recent samplers from Matt Hoffman (CheeS and MEADS, in particular).
While @betanalpha is right that this is hard to do correctly theoretically, I don’t think there’s much danger in practice. Just run and evaluate and see if it works. I’d be very surprised if you could see any miscalibration from this that would be noticeable among the rest of the noise. The reason it’s more promising than for general early stopping in, say, clinical trials and p-value computation is that ESS tends to monotonically increase with more samples, and the estimates we’re making tend to have large residual uncertainty an order or magnitude or more larger than the MCMC standard errors we’re trying to control (at least assuming we’ve hit ESS = 100).
[Edit: I also meant to add that we do this all the time informally. That is, we run for N iterations and if ESS is too low we try running for 2 * N iterations.]
I don’t think that continuing this conversation in depth will be particular constructive at this point but I do want to add one last comment about the relevance of “theory”.
At this point we have seen multiple decades of methods based on the idea that Bayesian computation just has to be parallelizeable. Over and over again these methods come out overhyped so that they are quickly adopted by practitioners who take their outputs for granted. And over and over again theoretical and empirical testing verifies that the methods work well only in exceedingly ideal conditions (and not coincidentally where many other easier-to-implement approximations are also viable).
The problem is that in most cases we cannot just “see if it works” because we don’t have an exact posterior distribution to which we can compare the computational output. Instead we have to rely on subtle consistency conditions to construct diagnostics, and the robust application of those diagnostics in practice requires that annoying theory. For example heuristics based on effective sample size are meaningless if a Markov chain Monte Carlo central limit theorem doesn’t hold and wow does that central limit theorem shy away from complex posterior distributions.
The Markov chain Monte Carlo folklore all across applied fields is littered with misleading and fragile heuristics because the theory is taken for granted. For those interested in learning more you may find my writing, Writing - betanalpha.github.io, of interest as well as the many references therein.
Michael:
We have a bunch of examples in posteriordb (GitHub - stan-dev/posteriordb: Database with posteriors of interest for Bayesian inference) that people can try if they want to compare to the posterior distribution. It’s just simulations so it’s not exact, but it should be fine in practice. I agree that there will always be challenging cases where we’re not sure what the computation is doing, which is one reason that in our workflow we emphasize fitting multiple models and understanding how inferences for quantities of interest change for different models.