Plans for parallelization

If you want a large effective sample size per wall time, more chains will do that for you in an embarassingly parallel fashion.

If you want the fastest wall time to an effective sample size of 100, that can’t be done by parallelizing with more chains. The real bottleneck there is warmup.

The chains can’t be spread over more CPUs because each chain is serial (hence the name).

But as the other responders point out, what can be parallelized is the implementation of the log density and gradient.

That’s RStan and it actually requires more than that in terms of dynamic overhead for copies.

CmdStan can stream data out so there’s no memory overhead other than the data, which is rarely a bottleneck. RStan can stream out data, but I’m still not sure if it can turn off accumulating it in memory. I don’t know about PyStan, but if it doesn’t, feel free to open a feature request.