Multi-chain vs single-chain

The first link you provide is mostly about multithreading within chains, where you are using multithreaded processing within a single chain. This is a way to speed up the evaluation of individual chains, but is separate from the process of running multiple chains. In particular, the graph in the post is about the number of threads per chain, showing a plateau when the author runs 4 threads per chain for 4 chains; which is rightfully explains as plateauing since his computer has 16 cores.

There are a number of benefits to running multiple chains, but I’ll just share a primary one:

Multiple chains enable diagnostics such as R-hat to allow us to probe the validity of the sampler. The goal of any MCMC method is to generate samples from the target distribution \pi(\theta); however, we can’t ever really know in general whether our Markov chain has reached stationarity, i.e. the samples are from \pi(\theta). We have some theorems in our back pocket that tell us the samples will be asymptotically valid, but we are never in that asymptotic regime so we have to use heuristics and diagnostics to justify whether our samples are from \pi(\theta).

One method to do this is to run multiple chains and observe whether or not they have mixed. We initialize multiple chains from different points and if after some time it looks like all of the chains are generating samples from the same distribution, we can take this as a signal that all the chains have reached the same stationary distribution. R-hat measures this mixing behavior, where \hat{R} \approx 1 provides evidence in favor of mixing. Granted, in the presence of multimodal distributions even this is not necessarily a guarantee, but it’s a good start.

So by using multiple chains, Stan gives us additional information that let’s us better make the decision to trust the samples or not.

As to why the default is 4 chains, I imagine there’s some history there I’m unaware of. At the very least 4 core CPUs are fairly standard in most machines, so it seems like a reasonable baseline.

6 Likes