As I understand it, parallelism has taken great strides in the last year or so on three fronts: GPU, MPI and threading.
The GPU stuff requires >=1 gpu per chain, so that means either sampling chains in serial or shelling out for multiple GPUs (which may be tenable for some, esp if renting on a cloud compute service), and thanks to overheads the GPU accelerations are only worth it for certain kinds of model/data combos (i.e. large matrix operations). But for models like Gaussian Processes, it’s really exciting.
I haven’t looked into the threading or MPI stuff much, but gather that threading is for accelerating on a single local machine with >=2 physical-cores-per-chain available (though has anyone checked whether hyper-threading helps at all for the 1 physical-core-per-chain case?), while MPI is for working on a cluster.
There’s also some pretty important speedups coming in the form of parallel warmup so that separate chains can share info about the geometry of the problem, allowing all chains to get to sampling sooner.
If you don’t bother with any of the above and just throw lots of chains on lots of cloud cores, you definitely don’t want to try to get lots of effective samples per chain; some might naively think that more samples is better, but after a few thousand effective samples you’re really not gaining much inferentially and just giving yourself lots of post-processing headaches. Sometimes folks have models that sample inefficiently and try to overcome this by just grabbing lots of samples, but an inefficiently sampling model is usually a sign that something’s wrong and the model’s structure/parameterization needs to be re-considered. With an efficiently sampling model, you could throw it on lots of independent cores and only grab a few samples per chain after warmup, but you still have to go through the entire warmup on each chain, which is pretty wasteful.