Benefits of parallelization with a threadpool of the Intel TBB

These are the results on Windows for the Intel Core i7-5600U CPU at 3.60GHz (4 core, 8 hyper).

bench-2019-07-27_21-52-method-relative-scale-poisson-static

bench-2019-07-27_21-52-method-serial-scale-poisson-static

I resorted to building the exe file by hand with mingw32-make. It built everything as planned. However on execution it said that it cant find tbb.dll and tbbmalloc.dll. I played around a bit, I got it working by placing both dll files next to the exe file. I didnt dig deeper for now, probably a goof on my part.

I also had to tweak the script a bit to get it running, mainly the system2’s env argument doesnt work on Windows (at least thats what I read when I quickly googled my issues). So I used system() and Sys.setenv and unsetenv instead.

If you envision we might do something like this more often, I can try to make a complete (or at least try to converge to a ) working script for Windows.

Thanks for running these. That’s with g++ 4.9.3 from RTools?

No, you did not goof up… Windows cannot handle rpath hard-coding of dynamic libraries. You either copy the dlls into the same directory OR you define the PATH variable so that it includes the directory of the dlls. Then it also works just fine.

I don’t think we need to do that more often… sorry for the glitches… for me it was valuable to see this running on AMD and now on Windows is also nice. It is interesting to see that map_rect does not as bad as in the other cases, but TBB is still the clear winner.

1 Like

Yes, the Rtools 3.5 supplied g++.

What syntax did you use instead of system2? Just running into the same problem on windows now

Not at a computer at the moment to paste my code but I ended up using system(), which doesnt also set env variables. I set those manually before the system() call with setenv() and unset after the script ends.

Results here from a Macbook Pro 6 core i7-8750H CPU @ 2.20GHz with 16Gb Ram

bench-2019-07-30_17-03-method-relative-scale-poisson-static

bench-2019-07-30_17-03-method-serial-scale-poisson-static

My machine clearly doesn’t like the original map_rect! (something about chunk sizes maybe?)

About hyperthreading. I don’t know a fraction about all this as you do, but it was my understanding the hyperthreading was a firmware level setting - that to turn it off you have to reboot the machine and turn it off at the bios (or whatever is the mac equivalent). I know on my desktop machine this is the case. So is it hyperthreading not turned on for all these tests above? If so, the leveling off seen with higher core counts in my test here might mean more performance could be squeezed out of it?

Another thing. About the AMD/Threadripper results above. I have a Tr machine also - 12 core/24 thread. I’ll try to run this on it later. But AFAIK the AMD chips have more cache than equivalent intel chips generally - could that be why the AMD machines do well above ? Edit: On this cache point @wds15 - one difference between the chips in our laptops is that mine has 9MB cache whereas yours has 12MB: https://cpu.userbenchmark.com/Compare/Intel-Core-i9-8950HK-vs-Intel-Core-i7-8750H/m486215vsm470418 - What happens on yours if you run it up to 12 threads ?

Edit: I attach aslo the benchmark summary file for your information: bench-2019-07-30_17-03.csv (2.5 KB)

Any idea why speed goes down with map_rect? How big is this problem?

Thanks for running these.

From my experience Stan runs faster with more cache. So if AMD has more cache per core then this is a good reason for this cpu being better. Maybe hyperthreading is beneficial on my cpu with more cache… I can try, but I think I vaguely recall that it’s not any good.

@Bob_Carpenter map rect is terrible with the parallelization cost. It creates for every iteration num threads -1 (we use the main thread) fresh threads and fresh ad tapes. This cost goes up with more cores and at the same time the work per core goes down. The map rect should bow down at a later point if we increase the problem size. This is why the tbb is so much better…the threads stick around in the pool and the tbb work balancing should optimize work to core allocation, it would be interesting to see if the tbb also goes down in some circumstance…it probably will if used wrongly.

Btw…many thanks to anyone here!!! The goal was to show that the tbb does well and it does well on Intel and AMD. We have established this (for this relatively generic example).

So any further effort should go into the rfc on the parallel design in order to align us on the ideas brought forward so that we can move forward with this.

2 Likes

Ok so this is from my desktop - AMD Threadripper 1920 - 12cores/24 thread with 32GB ram (I turned off the ylimits to see just how bad map_rect was after I noticed it was reaaaaallly slow compared to TBB)

Here is the summary data: [bench-2019-07-30_18-55.csv|attachment].
(upload://dooLe5Zj374qo6Z0JeufohTMfZ.csv) (2.5 KB)

And more on the system:

lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              24
On-line CPU(s) list: 0-23
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD Ryzen Threadripper 1920X 12-Core Processor
Stepping:            1
CPU MHz:             3219.820
CPU max MHz:         3500.0000
CPU min MHz:         2200.0000
BogoMIPS:            6985.79
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-23

@wds15 - the 2019 MBP’s with 8 cores have 16MB cache so it could be interesting for someone to try that - but I’ve not got one! I am curious to see if yours performs better than mine at > 6 threads.
FYI I’m happy to try other models some time if you wish.

3 Likes

Let this run on my computer for a few different group and term sizes

It seems like map_rect does better as the number of terms in each group increases

Maybe I’m doing something goofy because these results are pretty wild!

Data from the tests are here

bench-2019-07-30_16-36.csv (92.8 KB)

I made the graphs with the below

library(data.table)
library(ggplot2)
perf_dt = fread("./bench-2019-07-30_16-36.csv")
perf_dt[, group_label := factor(paste0("Groups: ", groups), ordered = TRUE, levels = paste0("Groups: ", as.character(unique(sort(groups)))))]
perf_dt[, terms_label := factor(paste0("Terms: ", terms), ordered = TRUE, levels = paste0("Terms: ", as.character(unique(sort(terms)))))]
ggplot(perf_dt, aes(threads, method_speedup, colour=method_label, shape=method_label)) +
  geom_point() + geom_line() +
  geom_abline(slope=1, intercept=0, linetype=2) +
  ggtitle("Speedup vs 1 core of hierarchical Poisson likelihood reduce", "Each curve shows method specific speedup relative to 1 core of respective method") +
  ylab("Speedup vs 1 core") +
  xlab("Threads") +
  scale_x_log10(breaks=c(1, 1:8*4)) +
  scale_y_log10(breaks=c(1,4, 6, 10, 15, 20, 30)) +
  facet_wrap(group_label~terms_label, ncol = 3) +
  theme(legend.position = "bottom") +
  theme_bw()

ggplot(perf_dt, aes(threads, serial_speedup, colour=method_label, shape=method_label)) +
  geom_point() + geom_line() +
  geom_abline(slope=1, intercept=0, linetype=2) +
  ggtitle("Speedup vs 1 core of hierarchical Poisson likelihood reduce", "Each curve shows method specific speedup relative to 1 core of respective method") +
  ylab("Speedup vs 1 core") +
  xlab("Threads") +
  scale_x_log10(breaks=c(1, 1:8*4)) +
  scale_y_log10(breaks=c(1,2, 5, 10, 20, 40, 80, 160, 320)) +
  facet_wrap(group_label~terms_label, ncol = 3) +
  theme(legend.position = "bottom") +
  theme_bw()

Thanks. The fact that map_rect gets better when you go to larger problems make a lot of sense - the additional cost of parallelization introduced by the thread creation gets less in relative terms of the total work load. With these huge workloads the extra parallelization cost becomes vanishingly small compared to the work to do per CPU… you are crunching really large likelihoods here - the largest case has 2.5 * 10^6 terms and we go from 2.3h down to 0.16h on 16 cores!

Given that the problems are so large it seems that the way I deal with patching together the AD tree is a lot more efficient with the new approach as opposed to the map_rect way of doing it, since the larger parallelization cost of map_rect is irrelevant.

Another interesting point is the TBB reduce being somewhat variable. Still better than map_rect in most cases, but I find it worthy for improvements.

Your plots also show that hyper-threading is useless; at least from my point of view.

Here are results on my laptop with hyper-threading. These show that hyper-threading on this machine does not buy me anything.


1 Like

Hold on there before you throw out all hyperthreading altogether. You can see in my Threadripper machine that hyperthreading does improve things (albeit at a slower rate). I think it is more likely that the cache is saturated on the intel machines. You can also see for Stevo’s machine above hyperthreading helps with the TBB reduce method at least, but not the TBB map. Note that mine and Stevo’s machine actually have the same cache despite his having more cores. Seems to me that cache needed per core/thread may be the limiting factor - not hyperthreading per se which is on by default anyhow. You can say only use 6 cores but the OS may well be spreading that out over 12 physical or virtual cores - you can see this happening in linux if you use the top command then press 1 - it uses all the cores at low load rather than just 6 at max load. If the program needs more cache per core than the total cache allows - stuff slows down (or at least stops speeding up). No?

The reason I was curious to see your MBP results is because your MBP and my MBP are very close in spec, with CPU speed and cache size being some of the few differences. To the rough inspection it looks like on your machine performance plateaus at 6 cores - on mine it may even dip. You could use my csv file in the first post above to plot yours and mine on the same graph.

True. AMD seems to do a lot better with hyperthreading.

I forgot to upload the csv file:

bench-2019-07-31_12-59.csv (2.1 KB)

EDIT: BTW, I updated the github to now include nicer messages for windows users when they either use the wrong make and I also tell them what to do about their PATH variable. I hope this helps future users to avoid these traps.

1 Like

What I’m kind of getting at here is this - could you improve the performance for higher core counts by factoring in the machines cache size into the chunking algorithm somehow ?

Maybe… I don’t know. The thing is that we want the TBB overtake all the magic as much as possible. The grainsize is only necessary when those automatisms fail for whatever reason or you as user knows more than the TBB.

With grainsize you can limit the parallelisim used in case you have more cores than work. So this could come handy if you have multiple parallel regions of your code and one region requires not that many cores while the other requires a high core count.

The Intel folks recommend to ignore grainsize to start with. If you think the performance is not good, then you set grainsize to rather larger values and lower it. You do that either until you are happy or until the performance degrades again.

1 Like

Hi, just wanted to note that Cpp-Taskflow exists, and maybe provides a small speedup even compared to TBB if you believe their paper. More importantly, it officially supports a CMake build, and would be straightforward to integrate as a dependency.

Thanks for that reference… the license with MIT is great…being header only is also great…but does cpp-taskflow really need a C++17 compiler?? That is what the requirements say - and that would not work with Stan for the foreseeable future. C++11 and some bits from C++14 is all what we can afford given our compilers we support.

Apart from that - can you say a bit more about it. I mean how long is it around, how broadly used, etc. The projects looks somehow young on a first look (compared to the TBB).

C++17 required is unfortunate, but RTools 4.0 will use GCC 7, if that’s the blocker. As for who uses it, you are correct - it is not as widely adopted as TBB, being that it was published only in the last few years.

In any case if C++11 is a hard requirement I’m not sure it’s worth the trouble to even benchmark it.

EDIT: I looked through the readme and the threadpool apparently only requires C++14; it might be possible to edit the sources, but I have no idea of how much work would be involved.

What definitely makes it interesting is the license and the header-only requirement.

So, yes, RTools is the main requirement here. The jump to 4.0 is yet to come and we should not go too quickly to these super new compilers… RHEL still uses g++ 4.9… and that is a very widely used server system.

Also, the TBB brings a scalable memory allocator which turned out to be important for good multi-threaded performance of Stan.

We will add cpp-taskflow to alternatives during the RFC which we have right now.

2 Likes

If you would like I could try on real large problem on a node with 28 cores on Linux. Just let me know how to compile the cmdstan and what switces/environmental variables to use when running the sampler. Currently my make/local has
CXXFLAGS += -DSTAN_THREADS
I am using map_rect.

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 28
On-line CPU(s) list: 0-27
Thread(s) per core: 1
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel® Xeon® CPU E5-2695 v3 @ 2.30GHz
Stepping: 2
CPU MHz: 2300.000
CPU max MHz: 2300.0000
CPU min MHz: 1200.0000
BogoMIPS: 4594.50
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
NUMA node0 CPU(s): 0-6,14-20
NUMA node1 CPU(s): 7-13,21-27