The problem Dan brings up is that one or two chains might consume all of your memory. If you have a small computer or if you have lot of data or if your model builds a very large expression graph during autodiff, you can run out of memory before firing up four parallel chains. In these cases, running multi-threaded or even multi-process can be a win because threads can share memory and the spawned processes may only have to deal with a single shard of the data (be it a block of a matrix in a multiplication or a fold of data in a distributed likelihood calculation).
This isn’t usually a concern for someone with 8GB or 16GB of memory and four cores fitting models consuming all their local computer power. I’m probably not going to try to fit a model with 2GB of data on my notebook. But I do realize other people have smaller notebooks. Dan had a Macbook Air when I saw him last week; that might not even be powerful enough to compile Stan without swapping.
On the other end, we also run into scalability issues. If you have sixteen cores on your local machine, you won’t get much benefit in wall-time to usable answer. The problem there is that we want to run four chains for diagnostic purposes, but sixteen just increases our marginal effective sample size per unit time.
If the goal is a large effective sample size, MCMC is embarassingly parallel. If the goal is to find a single effective draw from the posterior (an effective sample size of one), running in parallel doesn’t help (other than the critical role of diagnosing cases where you don’t have a single effective draw).
P.S. Dan’s not disagreeing about what’s built into Stan. The capability to run parallel chains is built into RStan (and presumably PyStan; it’s easy to run in CmdStan).