Map_rect tutorial performance comparison - small vs. big machine

stemangiola · February 14, 2019, 1:09am

Hello,

I have been running the tutorial https://github.com/rmcelreath/cmdstan_map_rect_tutorial with success into using more cores. However the execution time remains the same

With NO MPI

~/cmdstan-2.18.1/redcards$ time ./logistic0 sample data file=redcard_input.R

Gradient evaluation took 0.030967 seconds
1000 transitions using 10 leapfrog steps per transition would take 309.67 seconds.

[...]

|real|5m28.342s|
|---|---|
|user|5m28.069s|
|sys|0m0.096s|

With MPI (3 cores given but 4 cores active ?)

~/cmdstan-2.18.1/redcards$ time ./logistic1 sample data file=redcard_input.R

Gradient evaluation 3 times slower (is it because I gave 3 cores and is summing the times? If so is confusing)

Gradient evaluation took 0.090448 seconds
1000 transitions using 10 leapfrog steps per transition would take 904.48 seconds.

|real|6m34.745s|
|---|---|
|user|18m1.636s|
|sys|1m2.689s|

System info

wds15 · February 14, 2019, 9:17am

The makefiles for MPI are messed up in 2.18.1. I forget where the patch is for that… an option is to use develop right if that is an option… while develop is supposed to work just fine it is unreleased software…

Sorry for that.

stemangiola · February 14, 2019, 10:05pm

Thanks.

Is this likely to be the case for rstan too? Meaning that I should wait for the next release of stan math (or cmdstan/rstan) is out?

stanuser · February 14, 2019, 10:31pm

@wds15 What about for CmdStan 2.18.0? I’m seeing a similar dramatic slow down when STAN_NUM_THREADS > 1.

I’ve tried both Mac OS 10.14.3 and Ubuntu 18.04.2.

I’m adding CXXFLAGS += -DSTAN_THREADS and CXXFLAGS += -pthread to the make/local file of CmdStan before I call make build on CmdStan as well as make on the .stan file.

Is there a step I’m missing?

Edit: Also tried the CmdStan develop branch. Same thing. Dramatic slowdown when more than one thread is used. I’m running the reference model by the way: (https://mc-stan.org/docs/2_18/stan-users-guide/example-mapping-logistic-regression.html).

wds15 · February 15, 2019, 7:53am

Please use cmdstan 2.18.1 to get the fix for map_rect for the threading case.

Bear in mind thaat map_rect was written for models which use ode’s or similar super expensive things to calculate. Where it doesn’t perform well are cases when you have to give up vectorization and replace that with a parallel map_rect call.

If you want to have a chance to get some speedup you would have to write your code such that vectorizarion is still used inside map_rect. In that case you may see speedups for complicated likelihoods and large data.

We are working on hopefully also speeding up the vectorization case with threading. That will take a while.

Rstan does not support mpi, but you can use threading with it.

stanuser · February 15, 2019, 3:02pm

Thanks @wds15 Not what I wanted to hear but that makes sense.

stemangiola · February 15, 2019, 11:05pm

Sorry if I ask for more clarification, but surely due to lack of knowledge I cannot match these two statements:

The problem is that in the tutorial there is speedup: from

to

While I cannot see any improvements on my machine. I just wanted to know whether I have to stay quiet until the next release (of math, cmdstan, rstan)?

Thanks a lot!

P.S. I often use hierarchical models where I vectorise one dimension (replicates) of the data and loop the other (genes), so for me could be a game changer :)

wds15 · February 16, 2019, 4:17pm

Hi!

Your case could amend to speedup as it sounds. The situation is:

2.18.0 has a stupid bug in the threading code of map_rect which can rarely hit you. That is fixed in 2.18.1 (which is on CRAN)
2.18.0 & 2.18.1 has somewhat broken makefiles for MPI; and MPI map_rect only works with cmdstan to my knowledge

… but you could try to get MPI working with 2.18.1 by including into make/local

CXX=mpicxx
CC=mpicxx
STAN_MPI=true
CXXFLAGS+=-DSTAN_MPI

I haven’t tested that, but it could work… let me know if that does it should you try.

Best,
Sebastian

stemangiola · February 18, 2019, 4:59am

I tried the exact tutorial on a machine with 8+ cores (there are 7 shards in the data), and I was able to see increase in performances.

NO MPI

real    5m14.365s
user    5m14.121s
sys     0m0.145s

YES MPI

real    2m21.317s
user    12m30.617s
sys     2m30.149s

For a total of 7x cores and 2.2x efficiency gain.

The load was rarely 700%, and mostly 2-300% as you can imagine.

System:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             40
NUMA node(s):          3
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-4650 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2199.009
BogoMIPS:              4399.99
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13
NUMA node1 CPU(s):     14-27
NUMA node2 CPU(s):     28-39

Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /stornext/System/data/apps/R/R-3.5.1/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/apps/R/R-3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.1

One question remains though, is having more shards than cores detrimental for performances? I will try to test it.

wds15 · February 18, 2019, 6:02pm

Sounds like my makefile Hack from above works?

The question of shards is difficult. The current implementation just splits the shards into equally sized blocks and the number of blocks depends on the number of mpi processes. The shard order determines which mpi process will process a given shard.

We will hopefully integrate the intel tbb. Then the threading case will have some clever queuing implemented and users don‘t need to worry about the shard number anymore…but that will take a while (this is not all settled yet).

stemangiola · February 18, 2019, 10:48pm

Actually, I kept the original male/local suggested in the tutorial

CXXFLAGS += -DSTAN_THREADS
CXXFLAGS += -pthread

While appending your tags produced error (I haven’t tried your tags only).

That would make the whole thing convenient for implementation. Would be amazing to do

for(x in 1:X) data ~ distr_parallel(.., ..)
... (and in the same code)
for(y in 1:Y) target += distr_lpdf_parallel(data | .., ..)

One year?

stemangiola · February 19, 2019, 1:58am

@wds15 A test on another server:

NO MPI

Gradient evaluation took 0.05 seconds

real    4m21.414s
user    4m20.811s
sys     0m0.042s

YES MPI

Gradient evaluation took 0.34 seconds (!!)

real    6m55.186s
user    34m22.194s
sys     2m53.557s

Again, I have a loss in performances. No matter if small or big, different machines have so heterogeneous results. Can you spot some critical differences between the two systems that could explain the fail in gain in efficiency? (e.g., Cores per soket, CPU family, …)

System

unix309 501 %  lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                46
On-line CPU(s) list:   0-45
Thread(s) per core:    1
Core(s) per socket:    23
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Stepping:              2
CPU MHz:               2593.993
BogoMIPS:              5187.98
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-45

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.4 (Final)

Matrix products: default
BLAS: /stornext/System/data/bioinf-centos6/bioinf-software/bioinfsoftware/R/R-3.5.1/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/bioinf-centos6/bioinf-software/bioinfsoftware/R/R-3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.1

I also tried to use just you make/local

CXX=mpicxx
CC=mpicxx
STAN_MPI=true
CXXFLAGS+=-DSTAN_MPI

but the compilation fails

wds15 · February 19, 2019, 3:09pm

A wild guess here is that the MPI system is badly configured on some of these machines? Do you get more reliable results with threading?

Possibly my makefile suggestion also needs the pthreads thing (which is very system specific).

I am hesitant to give predictions on time-estimates of an enhanced map_rect.

stemangiola · February 20, 2019, 12:07am

I was wondering what threading is. I obviously not the normal chains in parallel (?) and I suppose it is a kind of MPI alternative for within-chain parallelisation? Is there a tutorial on how to use threading instead of MPI

A decently big wall for users to adopt map_rect is the lack of knowledge on what system goes with what makefile. I will try to share my experience to help some sort of “startup guideline”

wds15 · February 20, 2019, 9:08am

Within-chain parallelisation works with MPI or threading. These are different techniques for parallelism. MPI will take precedence over threading and it is not recommended to switch on both at the same time. So if you use MPI, then you should make sure that no “-DSTAN_THREADS” is popping up in your makefiles.

Sorry for that… this is why I keep saying that threading is simpler to get going. Writing a “startup guideline” would be great and I am happy to comment. You are very much welcome to make suggestions to our documentation to keep future users from running against walls.

jroon · March 1, 2019, 3:07pm

wds15:

Your case could amend to speedup as it sounds. The situation is:

2.18.0 has a stupid bug in the threading code of map_rect which can rarely hit you. That is fixed in 2.18.1 (which is on CRAN)

2.18.0 & 2.18.1 has somewhat broken makefiles for MPI; and MPI map_rect only works with cmdstan to my knowledge

… but you could try to get MPI working with 2.18.1 by including into make/local

Sorry to but in here but I have a question about this bug and versions. I’m on Mac Os Mojave and Ive installed rstan 2.18.2. However when I check the version of my underlying stan installation I get 2.18.0 as follows:

packageVersion(“rstan”)
[1] ‘2.18.2’
rstan::stan_version()
[1] “2.18.0”

So… I’m confused here as to whether I’m susceptible to the map_rect bug or not. Do I need to update the underlying stan_version and if so how? (I’ve searched and cannot find such instructions beyond updating rstan version).

wds15 · March 1, 2019, 8:18pm

Please check the version of the StanHeaders package…that has to be 2.18.1 at least to be safe.

jroon · March 1, 2019, 8:26pm

Ok I updated StanHeaders to 2.18.1, then reinstalled rstan just to be sure. How it looks like this:

packageVersion("rstan")
[1] ‘2.18.2’
packageVersion("StanHeaders")
[1] ‘2.18.1’
rstan::stan_version()
[1] "2.18.0"

Am I good?

wds15 · March 2, 2019, 5:19pm

I think, yes… though I don’t quite know what rstan::stan_version() is for and how it get’s the 2.18.0 shown. The StanHeaders looks fine to me.

jroon · March 2, 2019, 5:27pm

stan_version() is a command in the rstan package to determine your stan version. I had a peek at the code and its this:

> rstan::stan_version
function () 
{
    .Call(CPP_stan_version)
}

So I guess its telling my the C++version of stan installed ??

Topic		Replies	Views
Map_rect concurrent about to land Developers math	34	2347	July 23, 2018
Map_rect threading Developers	14	1621	September 11, 2019
Troubleshooting multithreading with map_rect in RStan RStan	6	945	May 22, 2020
RStan (PyStan) & MPI / GPU Developers features	43	3601	September 24, 2017
Benefits of parallelization with a threadpool of the Intel TBB Developers	39	5836	October 25, 2019

Map_rect tutorial performance comparison - small vs. big machine

Related topics