Gradient evaluation took 0.030967 seconds
1000 transitions using 10 leapfrog steps per transition would take 309.67 seconds.
[...]
|real|5m28.342s|
|---|---|
|user|5m28.069s|
|sys|0m0.096s|
With MPI (3 cores given but 4 cores active ?)
~/cmdstan-2.18.1/redcards$ time ./logistic1 sample data file=redcard_input.R
Gradient evaluation 3 times slower (is it because I gave 3 cores and is summing the times? If so is confusing)
Gradient evaluation took 0.090448 seconds
1000 transitions using 10 leapfrog steps per transition would take 904.48 seconds.
|real|6m34.745s|
|---|---|
|user|18m1.636s|
|sys|1m2.689s|
The makefiles for MPI are messed up in 2.18.1. I forget where the patch is for thatā¦ an option is to use develop right if that is an optionā¦ while develop is supposed to work just fine it is unreleased softwareā¦
@wds15 What about for CmdStan 2.18.0? Iām seeing a similar dramatic slow down when STAN_NUM_THREADS > 1.
Iāve tried both Mac OS 10.14.3 and Ubuntu 18.04.2.
Iām adding CXXFLAGS += -DSTAN_THREADS and CXXFLAGS += -pthread to the make/local file of CmdStan before I call make build on CmdStan as well as make on the .stan file.
Please use cmdstan 2.18.1 to get the fix for map_rect for the threading case.
Bear in mind thaat map_rect was written for models which use odeās or similar super expensive things to calculate. Where it doesnāt perform well are cases when you have to give up vectorization and replace that with a parallel map_rect call.
If you want to have a chance to get some speedup you would have to write your code such that vectorizarion is still used inside map_rect. In that case you may see speedups for complicated likelihoods and large data.
We are working on hopefully also speeding up the vectorization case with threading. That will take a while.
Rstan does not support mpi, but you can use threading with it.
Sorry if I ask for more clarification, but surely due to lack of knowledge I cannot match these two statements:
The problem is that in the tutorial there is speedup: from
to
While I cannot see any improvements on my machine. I just wanted to know whether I have to stay quiet until the next release (of math, cmdstan, rstan)?
Thanks a lot!
P.S. I often use hierarchical models where I vectorise one dimension (replicates) of the data and loop the other (genes), so for me could be a game changer :)
The question of shards is difficult. The current implementation just splits the shards into equally sized blocks and the number of blocks depends on the number of mpi processes. The shard order determines which mpi process will process a given shard.
We will hopefully integrate the intel tbb. Then the threading case will have some clever queuing implemented and users donāt need to worry about the shard number anymoreā¦but that will take a while (this is not all settled yet).
Gradient evaluation took 0.05 seconds
real 4m21.414s
user 4m20.811s
sys 0m0.042s
YES MPI
Gradient evaluation took 0.34 seconds (!!)
real 6m55.186s
user 34m22.194s
sys 2m53.557s
Again, I have a loss in performances. No matter if small or big, different machines have so heterogeneous results. Can you spot some critical differences between the two systems that could explain the fail in gain in efficiency? (e.g., Cores per soket, CPU family, ā¦)
System
unix309 501 % lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 46
On-line CPU(s) list: 0-45
Thread(s) per core: 1
Core(s) per socket: 23
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Stepping: 2
CPU MHz: 2593.993
BogoMIPS: 5187.98
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-45
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.4 (Final)
Matrix products: default
BLAS: /stornext/System/data/bioinf-centos6/bioinf-software/bioinfsoftware/R/R-3.5.1/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/bioinf-centos6/bioinf-software/bioinfsoftware/R/R-3.5.1/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.1
I was wondering what threading is. I obviously not the normal chains in parallel (?) and I suppose it is a kind of MPI alternative for within-chain parallelisation? Is there a tutorial on how to use threading instead of MPI
A decently big wall for users to adopt map_rect is the lack of knowledge on what system goes with what makefile. I will try to share my experience to help some sort of āstartup guidelineā
Within-chain parallelisation works with MPI or threading. These are different techniques for parallelism. MPI will take precedence over threading and it is not recommended to switch on both at the same time. So if you use MPI, then you should make sure that no ā-DSTAN_THREADSā is popping up in your makefiles.
Sorry for thatā¦ this is why I keep saying that threading is simpler to get going. Writing a āstartup guidelineā would be great and I am happy to comment. You are very much welcome to make suggestions to our documentation to keep future users from running against walls.
Sorry to but in here but I have a question about this bug and versions. Iām on Mac Os Mojave and Ive installed rstan 2.18.2. However when I check the version of my underlying stan installation I get 2.18.0 as follows:
Soā¦ Iām confused here as to whether Iām susceptible to the map_rect bug or not. Do I need to update the underlying stan_version and if so how? (Iāve searched and cannot find such instructions beyond updating rstan version).