I’m a huge fan of Stan, but (what appears to be) a limit to use 4 cores only is pretty restrictive in the world of parallel cloud computing. I saw a few posts discussing increases to this from Feb 2017, but I wondered if there was an update on progress/plans in this area.
Also, it seems to me that if the dataset is “too big” then only one core will be utilised. Has anyone else encountered this? I’ve tested and if I reduce the dataset size I get to a point where all cores will be utilised.
MPI is coming and will hopefully land in the next release. This will give you a
map_rect function for which @Bob_Carpenter has posted on discourse as to how it will look exactly. If you search this forum a bit you will find posts from me showing what speedups you can get. It won’t be nice to program it, but for compute intensive models you will greatly benefit.
Once it’s in Stan and your problem amends to MPI, then it is a matter of buying good hardware with a fast Infiniband network if you can.
There’s no limit to the number of cores. 4 cores is just a default.
As I understand it, when you do N parallel chains and your data takes up X memory, stan needs to allocate N*X memory; if your system doesn’t have that much, it will start paging to disk and one or more chains will consequently be slower. If you look at a process monitor like
top in unix, you should still see all the chains churning though.
Thanks for the response.
I’m using the PyStan interface, and was attempting to use the n_jobs=-1 input to use all CPUs, but it doesn’t use more than 4…>?
Are you talking about using more than 4 chains? I want to stick with 4 chains but spread over more CPUs.
Stan currently only allows for each chain to run in parallel; parallelization within a chain will come with the MPI stuff to which wds15 refers.
Ok great, thanks for the info.
If you want a large effective sample size per wall time, more chains will do that for you in an embarassingly parallel fashion.
If you want the fastest wall time to an effective sample size of 100, that can’t be done by parallelizing with more chains. The real bottleneck there is warmup.
The chains can’t be spread over more CPUs because each chain is serial (hence the name).
But as the other responders point out, what can be parallelized is the implementation of the log density and gradient.
That’s RStan and it actually requires more than that in terms of dynamic overhead for copies.
CmdStan can stream data out so there’s no memory overhead other than the data, which is rarely a bottleneck. RStan can stream out data, but I’m still not sure if it can turn off accumulating it in memory. I don’t know about PyStan, but if it doesn’t, feel free to open a feature request.
Thanks for info @Bob_Carpenter.
My general understanding on what’s going on when training these models is limited, so I’ll so a little extra reading before getting back.
Thanks to all for the advice!