Plans for parallelization

Mike_Terrell · March 23, 2018, 10:46am

Hi,

I’m a huge fan of Stan, but (what appears to be) a limit to use 4 cores only is pretty restrictive in the world of parallel cloud computing. I saw a few posts discussing increases to this from Feb 2017, but I wondered if there was an update on progress/plans in this area.

Also, it seems to me that if the dataset is “too big” then only one core will be utilised. Has anyone else encountered this? I’ve tested and if I reduce the dataset size I get to a point where all cores will be utilised.

Thanks!
Mike

wds15 · March 23, 2018, 12:25pm

MPI is coming and will hopefully land in the next release. This will give you a map_rect function for which @Bob_Carpenter has posted on discourse as to how it will look exactly. If you search this forum a bit you will find posts from me showing what speedups you can get. It won’t be nice to program it, but for compute intensive models you will greatly benefit.

Once it’s in Stan and your problem amends to MPI, then it is a matter of buying good hardware with a fast Infiniband network if you can.

jjramsey · March 23, 2018, 12:34pm

There’s no limit to the number of cores. 4 cores is just a default.

mike-lawrence · March 23, 2018, 1:05pm

As I understand it, when you do N parallel chains and your data takes up X memory, stan needs to allocate N*X memory; if your system doesn’t have that much, it will start paging to disk and one or more chains will consequently be slower. If you look at a process monitor like top in unix, you should still see all the chains churning though.

Mike_Terrell · March 23, 2018, 1:55pm

Thanks for the response.

I’m using the PyStan interface, and was attempting to use the n_jobs=-1 input to use all CPUs, but it doesn’t use more than 4…>?

Are you talking about using more than 4 chains? I want to stick with 4 chains but spread over more CPUs.

Mike_Terrell · March 23, 2018, 1:56pm

Thanks Mike!

mike-lawrence · March 23, 2018, 2:14pm

Stan currently only allows for each chain to run in parallel; parallelization within a chain will come with the MPI stuff to which wds15 refers.

Mike_Terrell · March 23, 2018, 2:28pm

Ok great, thanks for the info.

Bob_Carpenter · March 25, 2018, 9:03pm

If you want a large effective sample size per wall time, more chains will do that for you in an embarassingly parallel fashion.

If you want the fastest wall time to an effective sample size of 100, that can’t be done by parallelizing with more chains. The real bottleneck there is warmup.

The chains can’t be spread over more CPUs because each chain is serial (hence the name).

But as the other responders point out, what can be parallelized is the implementation of the log density and gradient.

That’s RStan and it actually requires more than that in terms of dynamic overhead for copies.

CmdStan can stream data out so there’s no memory overhead other than the data, which is rarely a bottleneck. RStan can stream out data, but I’m still not sure if it can turn off accumulating it in memory. I don’t know about PyStan, but if it doesn’t, feel free to open a feature request.

Mike_Terrell · March 26, 2018, 11:22am

Thanks for info @Bob_Carpenter.

My general understanding on what’s going on when training these models is limited, so I’ll so a little extra reading before getting back.

Thanks to all for the advice!

Topic		Replies	Views
Does Stan use multiple cores for multiple chains simultaneously even if we don't specify cores = 4 for chains =4? General	3	481	March 26, 2020
What is the main bottleneck in parallelizing Stan? Developers	6	909	September 27, 2019
Running cmdstanr in parallel on computing cluster General	6	985	December 9, 2022
Extending reduce_sum to use MPI Developers	5	483	June 1, 2021
Running MPI General performance	9	1949	August 24, 2018

Plans for parallelization

Related topics