Map_rect tutorial performance comparison - small vs. big machine

Hello,

I have been running the tutorial https://github.com/rmcelreath/cmdstan_map_rect_tutorial with success into using more cores. However the execution time remains the same

With NO MPI

~/cmdstan-2.18.1/redcards$ time ./logistic0 sample data file=redcard_input.R

Gradient evaluation took 0.030967 seconds
1000 transitions using 10 leapfrog steps per transition would take 309.67 seconds.

[...]

|real|5m28.342s|
|---|---|
|user|5m28.069s|
|sys|0m0.096s|

With MPI (3 cores given but 4 cores active ?)

~/cmdstan-2.18.1/redcards$ time ./logistic1 sample data file=redcard_input.R

Gradient evaluation 3 times slower (is it because I gave 3 cores and is summing the times? If so is confusing)

Gradient evaluation took 0.090448 seconds
1000 transitions using 10 leapfrog steps per transition would take 904.48 seconds.

|real|6m34.745s|
|---|---|
|user|18m1.636s|
|sys|1m2.689s|

System info

image

1 Like

The makefiles for MPI are messed up in 2.18.1. I forget where the patch is for thatā€¦ an option is to use develop right if that is an optionā€¦ while develop is supposed to work just fine it is unreleased softwareā€¦

Sorry for that.

Thanks.

Is this likely to be the case for rstan too? Meaning that I should wait for the next release of stan math (or cmdstan/rstan) is out?

@wds15 What about for CmdStan 2.18.0? Iā€™m seeing a similar dramatic slow down when STAN_NUM_THREADS > 1.

Iā€™ve tried both Mac OS 10.14.3 and Ubuntu 18.04.2.

Iā€™m adding CXXFLAGS += -DSTAN_THREADS and CXXFLAGS += -pthread to the make/local file of CmdStan before I call make build on CmdStan as well as make on the .stan file.

Is there a step Iā€™m missing?

Edit: Also tried the CmdStan develop branch. Same thing. Dramatic slowdown when more than one thread is used. Iā€™m running the reference model by the way: (https://mc-stan.org/docs/2_18/stan-users-guide/example-mapping-logistic-regression.html).

Please use cmdstan 2.18.1 to get the fix for map_rect for the threading case.

Bear in mind thaat map_rect was written for models which use odeā€™s or similar super expensive things to calculate. Where it doesnā€™t perform well are cases when you have to give up vectorization and replace that with a parallel map_rect call.

If you want to have a chance to get some speedup you would have to write your code such that vectorizarion is still used inside map_rect. In that case you may see speedups for complicated likelihoods and large data.

We are working on hopefully also speeding up the vectorization case with threading. That will take a while.

Rstan does not support mpi, but you can use threading with it.

Thanks @wds15 Not what I wanted to hear but that makes sense.

Sorry if I ask for more clarification, but surely due to lack of knowledge I cannot match these two statements:

The problem is that in the tutorial there is speedup: from

image

to

image

While I cannot see any improvements on my machine. I just wanted to know whether I have to stay quiet until the next release (of math, cmdstan, rstan)?

Thanks a lot!

P.S. I often use hierarchical models where I vectorise one dimension (replicates) of the data and loop the other (genes), so for me could be a game changer :)

Hi!

Your case could amend to speedup as it sounds. The situation is:

  • 2.18.0 has a stupid bug in the threading code of map_rect which can rarely hit you. That is fixed in 2.18.1 (which is on CRAN)
  • 2.18.0 & 2.18.1 has somewhat broken makefiles for MPI; and MPI map_rect only works with cmdstan to my knowledge

ā€¦ but you could try to get MPI working with 2.18.1 by including into make/local

CXX=mpicxx
CC=mpicxx
STAN_MPI=true
CXXFLAGS+=-DSTAN_MPI

I havenā€™t tested that, but it could workā€¦ let me know if that does it should you try.

Best,
Sebastian

I tried the exact tutorial on a machine with 8+ cores (there are 7 shards in the data), and I was able to see increase in performances.

NO MPI

real    5m14.365s
user    5m14.121s
sys     0m0.145s

YES MPI

real    2m21.317s
user    12m30.617s
sys     2m30.149s

For a total of 7x cores and 2.2x efficiency gain.

The load was rarely 700%, and mostly 2-300% as you can imagine.

System:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             40
NUMA node(s):          3
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-4650 v4 @ 2.20GHz
Stepping:              1
CPU MHz:               2199.009
BogoMIPS:              4399.99
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13
NUMA node1 CPU(s):     14-27
NUMA node2 CPU(s):     28-39

Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /stornext/System/data/apps/R/R-3.5.1/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/apps/R/R-3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.1


One question remains though, is having more shards than cores detrimental for performances? I will try to test it.

2 Likes

Sounds like my makefile Hack from above works?

The question of shards is difficult. The current implementation just splits the shards into equally sized blocks and the number of blocks depends on the number of mpi processes. The shard order determines which mpi process will process a given shard.

We will hopefully integrate the intel tbb. Then the threading case will have some clever queuing implemented and users donā€˜t need to worry about the shard number anymoreā€¦but that will take a while (this is not all settled yet).

Actually, I kept the original male/local suggested in the tutorial

CXXFLAGS += -DSTAN_THREADS
CXXFLAGS += -pthread

While appending your tags produced error (I havenā€™t tried your tags only).

That would make the whole thing convenient for implementation. Would be amazing to do

for(x in 1:X) data ~ distr_parallel(.., ..)
... (and in the same code)
for(y in 1:Y) target += distr_lpdf_parallel(data | .., ..) 

One year?

@wds15 A test on another server:

NO MPI

Gradient evaluation took 0.05 seconds

real    4m21.414s
user    4m20.811s
sys     0m0.042s

YES MPI

Gradient evaluation took 0.34 seconds (!!)

real    6m55.186s
user    34m22.194s
sys     2m53.557s


Again, I have a loss in performances. No matter if small or big, different machines have so heterogeneous results. Can you spot some critical differences between the two systems that could explain the fail in gain in efficiency? (e.g., Cores per soket, CPU family, ā€¦)

System

unix309 501 %  lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                46
On-line CPU(s) list:   0-45
Thread(s) per core:    1
Core(s) per socket:    23
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Stepping:              2
CPU MHz:               2593.993
BogoMIPS:              5187.98
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-45

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.4 (Final)

Matrix products: default
BLAS: /stornext/System/data/bioinf-centos6/bioinf-software/bioinfsoftware/R/R-3.5.1/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/bioinf-centos6/bioinf-software/bioinfsoftware/R/R-3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.5.1

I also tried to use just you make/local

CXX=mpicxx
CC=mpicxx
STAN_MPI=true
CXXFLAGS+=-DSTAN_MPI

but the compilation fails

A wild guess here is that the MPI system is badly configured on some of these machines? Do you get more reliable results with threading?

Possibly my makefile suggestion also needs the pthreads thing (which is very system specific).

I am hesitant to give predictions on time-estimates of an enhanced map_rect.

I was wondering what threading is. I obviously not the normal chains in parallel (?) and I suppose it is a kind of MPI alternative for within-chain parallelisation? Is there a tutorial on how to use threading instead of MPI

A decently big wall for users to adopt map_rect is the lack of knowledge on what system goes with what makefile. I will try to share my experience to help some sort of ā€œstartup guidelineā€

Within-chain parallelisation works with MPI or threading. These are different techniques for parallelism. MPI will take precedence over threading and it is not recommended to switch on both at the same time. So if you use MPI, then you should make sure that no ā€œ-DSTAN_THREADSā€ is popping up in your makefiles.

Sorry for thatā€¦ this is why I keep saying that threading is simpler to get going. Writing a ā€œstartup guidelineā€ would be great and I am happy to comment. You are very much welcome to make suggestions to our documentation to keep future users from running against walls.

Sorry to but in here but I have a question about this bug and versions. Iā€™m on Mac Os Mojave and Ive installed rstan 2.18.2. However when I check the version of my underlying stan installation I get 2.18.0 as follows:

packageVersion(ā€œrstanā€)
[1] ā€˜2.18.2ā€™
rstan::stan_version()
[1] ā€œ2.18.0ā€

Soā€¦ Iā€™m confused here as to whether Iā€™m susceptible to the map_rect bug or not. Do I need to update the underlying stan_version and if so how? (Iā€™ve searched and cannot find such instructions beyond updating rstan version).

Please check the version of the StanHeaders packageā€¦that has to be 2.18.1 at least to be safe.

Ok I updated StanHeaders to 2.18.1, then reinstalled rstan just to be sure. How it looks like this:

packageVersion("rstan")
[1] ā€˜2.18.2ā€™
packageVersion("StanHeaders")
[1] ā€˜2.18.1ā€™
rstan::stan_version()
[1] "2.18.0"

Am I good?

I think, yesā€¦ though I donā€™t quite know what rstan::stan_version() is for and how it getā€™s the 2.18.0 shown. The StanHeaders looks fine to me.

stan_version() is a command in the rstan package to determine your stan version. I had a peek at the code and its this:

> rstan::stan_version
function () 
{
    .Call(CPP_stan_version)
}

So I guess its telling my the C++version of stan installed ??