In order to understand an (approximately) optimal configuration for Markov chain Monte Carlo we really have to understand what Markov chain Monte Carlo is actually doing. Markov chain Monte Carlo uses the sequential states
\{\theta_{1}, \ldots, \theta_{n}, \ldots, \theta_{N} \}
within a Markov chain to construct Markov chain Monte Carlo estimators of target expectation values,
E_{\pi}[f] = \int \mathrm{d} \theta \, \pi(\theta) \, f(\theta) \approx \frac{1}{N + 1} \sum_{n = 0}^{N} f(\theta_n) = \hat{f}.
In order to quantify the practical utility of a Markov chain Monte Carlo estimator we need to quantify how well \hat{f} approximates E_{\pi}[f], although this is much easier said than done. Without additional assumptions there’s not much we can do to quantify this error, but under nice conditions where a central limit theorem holds we can quantify the error using the empirical autocorrelations between the evaluations f(\theta_n) and the total length of the Markov chain. In particular in these specific circumstances the longer the Markov chain the smaller the estimator errors will be – doubling the length of a Markov chain will decreases the error by a factor of \sqrt{2}.
At the same time the error quantification allowed by a central limit theorem allows us to combine Markov chain Monte Carlo estimators derived from separate Markov chains. Provided that each Markov chain is sufficiently well-behaved then combining the estimators from C chains will decrease the estimator error by a factor of \sqrt{C}.
To review – under ideal conditions we have two ways to decrease the Markov chain Monte Carlo estimator error. We can run one long Markov chain or multiple short Markov chains. Which one is better? There is a long history of people arguing both sides; unfortunately those arguments are largely tainted by people not being clear about their assumptions and hence talking past each other.
Under ideal conditions and a fixed computational resource there shouldn’t be any strong difference between the estimator error from one long Markov chain of length N or C shorter Marko chains of length N / C. In practice, however, there are some complications.
For example if the Markov chains are too short then we also have to account for an initialization bias. This can be avoided by discarding the initial states from each Markov chain at the cost of some computational overhead. The resulting overhead, however, will be much smaller in the one long Markov chain scenario than in the many smaller Markov chains scenario. For example if we have to discard the first W states then the overhead for the long chain will be W / N while the overhead for the ensemble of shorter chains will be C \cdot W / N. In other words the one long Markov chain will be more performant. The overhead becomes even more considerable when we have to take into account adaptation of the Markov transition.
On the other hand because the shorter Markov chains are independent they can be run in parallel, and hence take advantage of parallel computing resources that the one long Markov chain might not be able to. Even if the overhead is larger the speedup from parallelization can easily make the ensemble approach more performant. That said we can take parallelization only so far – each of the individual Markov chains still have to be long enough to avoid an initialization bias which limits just how much we can distribute the computation.
Of course all of these considerations hold only under ideal circumstances which can’t be taken for granted in practice. One huge benefit of running multiple Markov chains is that the ensemble is sensitive to violations of those ideal circumstances. If everything is nice enough than the terminal states (i.e. everything but the initial states that we remove per the discussion above) of all of the Markov chains should all behave the same, and any inconsistencies identify failures of those ideal circumstances and doubt in any empirical error quantification.
So when the ideal circumstances can’t be taken for granted and we have access to parallelization resources then running at least a few Markov chains will tend to lead to more robust if not more performant Markov chain Monte Carlo estimation. With Stan’s default Markov chain configuration running more than ten Markov chains is unlikely to offer much benefit unless the target distribution is particularly nasty. For example if there are more than ten modes then more than ten Markov chains with sufficiently diffuse initializations may be needed to even find all of the modes.
At the beginning of Stan’s development multiple cores were becoming common in commodity hardware, and most newer computers were shipping with at least four physical cores. This is what motivated the four Markov chain default that persists to this day.