Benchmarking Multithreading for stan:math::exp using tbb, Proof of Concept (POC)

I think I made this faster. So I’m scaling number of threads.

I did a quick grep through the stan/math, and it doesn’t look like anyone’s actually using TBB for multithreading yet. I used:

Here are some results, after this test case:

TEST(MathFunctions, expVecBench) {
  // std timing includes
  using std::chrono::high_resolution_clock;
  using std::chrono::duration_cast;
  using std::chrono::duration;
  using std::chrono::milliseconds;

  // stan math includes
  using stan::math::exp;
  using stan::math::init_threadpool_tbb;
  size_t N = 10000; // we're computing exp 10000 times but  scaling number of threads
  // scaling Nthreads by squares N, N^2, N^3
  std::cout << "N,nThreads,msInt,msDouble\n";
  for (int i = 1; i < 10; ++i) { 
    size_t Nthreads = 2;
    Nthreads  = std::pow(Nthreads, i);
    stan::math::init_threadpool_tbb(Nthreads);
    std::vector<double> vec(N);
    for (size_t i = 0; i < N; ++i) {
      vec[i] = i + 1;
    }
    std::vector<double> vec_test;
    
    auto t1 = high_resolution_clock::now();
    EXPECT_NO_THROW(vec_test = stan::math::exp_test(vec));
    auto t2 = high_resolution_clock::now();

	/* Getting number of milliseconds as an integer. */
	auto ms_int = duration_cast<milliseconds>(t2 - t1);

	/* Getting number of milliseconds as a double. */
	duration<double, std::milli> ms_double = t2 - t1;

	std::cout << N << ",";
	std::cout << Nthreads << ",";
	std::cout << ms_int.count() << "ms,";
	std::cout << ms_double.count() << "ms\n";
  }
}

results:

N,nThreads,msInt,msDouble
10000,2,0ms,0.161958ms
10000,4,0ms,0.11016ms
10000,8,0ms,0.069456ms
10000,16,0ms,0.145974ms
10000,32,0ms,0.06747ms
10000,64,0ms,0.074748ms
10000,128,0ms,0.06451ms
10000,256,0ms,0.089873ms
10000,512,0ms,0.065416ms

(you could repeat the experiment N times and get a MC estimate, but whatever)

And then without threading (removing the STAN_THREADS=true from make/local):

NO THREADING
N,noThreads,msInt,msDouble
10000,NA,0ms,0.104313ms

And I’m guessing the advantages of threading plateaus because it costs more time to send threads than the advantage of computing it with else threads, if you understand what I’m trying to communicate? More communication between threads costs more than running the actual computation.

I think this looks good. Any other tests people want to see? This is a first run. I should re-run tests and increase N, but this could possibly add speed at a lower level, math theory aside.

The branch is here, I’m about to push: https://github.com/drezap/math/tree/feature/issue-3311-test-thread-tbb-exp

here:

To github.com:drezap/math.git
   7a955d9a52..0d66c6f4ba  feature/issue-3311-test-thread-tbb-exp -> feature/issue-3311-test-thread-tbb-exp

I need to be more rigorous, but this is cool. If I’ve messed up, please let me know.

the files to check out are: test/unit/math/prim/fun/exp_test.cpp and stan/math/prim/fun/exp.hpp. I need to edit function name but that’s an easy fix.

Also, looking at my max amount of threads, locally:

andre@compy:~/stan-dev/math$ cat /proc/sys/kernel/threads-max
255162

~ Regards

And then pushing it, given my max threads from command above:

andre@compy:~/stan-dev/math$ cat /proc/sys/kernel/threads-max
255162
> log2(250000)
[1] 17.93157

Let’s try 10,000,000 exp(evaluations), in parallel, with max 2^17 threads, the limits of my local computer, with some room so it doesn’t crash:

[       OK ] MathFunctions.expVecBench (2 ms)
[ RUN      ] MathFunctions.expVecBench_10000000_17
N,nThreads,msInt,msDouble
10000000,2,92ms,92.3552ms
10000000,4,70ms,70.391ms
10000000,8,69ms,69.2049ms
10000000,16,67ms,67.7656ms
10000000,32,66ms,66.2494ms
10000000,64,66ms,66.1261ms
10000000,128,66ms,66.5695ms
10000000,256,66ms,66.1206ms
10000000,512,66ms,66.6818ms
10000000,1024,66ms,66.1333ms
10000000,2048,66ms,66.6996ms
10000000,4096,67ms,67.3784ms
10000000,8192,65ms,65.8648ms
10000000,16384,65ms,65.1453ms
10000000,32768,67ms,67.1079ms
10000000,65536,66ms,66.3751ms

It looks like it plateaus again, and I’m not sure what TBB’s default is, I think this is SMP (I think meaning multiple threads on one processor) so may be more gains in MPP? Not sure, anyone?

~ regards

Here’s the test case for reference:

TEST(MathFunctions, expVecBench_10000000_17) {
  // std timing includes
  using std::chrono::high_resolution_clock;
  using std::chrono::duration_cast;
  using std::chrono::duration;
  using std::chrono::milliseconds;

  // stan math includes
  using stan::math::exp;
  using stan::math::init_threadpool_tbb;
  size_t N = 10000000; // we're computing exp 10000 times but  scaling number of threads
  // scaling Nthreads by squares N, N^2, N^3
  std::cout << "N,nThreads,msInt,msDouble\n";
  for (int i = 1; i < 17; ++i) { 
    size_t Nthreads = 2;
    Nthreads  = std::pow(Nthreads, i);
    stan::math::init_threadpool_tbb(Nthreads);
    std::vector<double> vec(N);
    for (size_t i = 0; i < N; ++i) {
      vec[i] = i + 1;
    }
    std::vector<double> vec_test;
    
    auto t1 = high_resolution_clock::now();
    EXPECT_NO_THROW(vec_test = stan::math::exp_test(vec));
    auto t2 = high_resolution_clock::now();

	/* Getting number of milliseconds as an integer. */
	auto ms_int = duration_cast<milliseconds>(t2 - t1);

	/* Getting number of milliseconds as a double. */
	duration<double, std::milli> ms_double = t2 - t1;

	std::cout << N << ",";
	std::cout << Nthreads << ",";
	std::cout << ms_int.count() << "ms,";
	std::cout << ms_double.count() << "ms\n";
  }
}

My supposition is that when the function gets more complex (i.e. more mathematical operators), then more threads will be more helpful. If we have a composite function (pretty much any probability distribution), then we can send threads through a probability distribution like, leukemia or something. Exp is just one operator, and they probably use a Taylor series to estimate it numerically, I’m guessing.

Edit 2:

I’m also running exp(10) instead of exp(i), i = 1,2,…,N since after a certain point we’re getting numerical overflow. After a certain point I think, if we’re estimating exp() with a Taylor series, I’m guessing the devs probably just stopped the program, so this could effect results.

But setting exp(10) we’re not going to experience overflow.

Note: It’s also suspicious to me, when I’m running the exp(10) test, I’m not getting any advantage running more threads, so I’m wondering if I’ve messed something up. If anyone’s better with tbb, feel free to comment. I’m going to keep looking into it.

So I’m looking at a range of (.02, +.02) at like .09ms for 10 unthreaded, N=10000. I don’t want to make a histogram right now of 10,000 trials.

For 10mm (10,000,000) without threading, I’m getting over 110ms (+/10 std dev, a ballpark). This is a drastic speedup. I’m not sure why there’s the plateau after 2^3 threads, though.

Anyone with more experience in this area, feel free to comment. I’m pushing the tests to the feature branch if anyone wants to play with it.

I think it’s a speedup but it’s plateauing on number of threads. Next is scaling # obs and # threads.

Edit:

Now it’s taking longer with threading? Not sure what I did.

Edit:

I think it’s because I had so many different processes running on my computer, but I think in general, the threaded version is faster, and I’m supposing there will be greater speed gain when using threading on “more composite” functions, and to effectively evaluate this, we’d need some dedicated cores (I have 5-6 youtube videos and ~42 tabs open, I’m not sure how windows manages threads on the OS scale). In order to do some series inference, we’d need some serious isolation of processes.

Anyway, I’m going to do some more tests (scaling N and num threads independently), and then I’ll refactor similar to openCL (so it’s in a different directory), and then may be open a PR.

This is still experimental, but if anyone wants to take a look to see if I’m doing something erroneous, that would be great.

Cheers.

Note: also need to terminate the threads, I run R in command prompt with WSL, and then I’m getting it calling R 1,000,000 times. So this could cause uncertainty. I have to open a new window. Any comments appreciated.

More threading.

So this time, I fixed number of threads to 2,4,8. I compared with unthreaded version. Clear gains as N (data set number of evalutations) increases. I suspect further gains with composite functions if we thread multiple elementary functions.

Any benchmark suggestions, or directions to head, I’m open to ideas.

Right now, I’m going to refactor this code similar to openCL, so that it’s in it’s own directory, and I’m not adulterating stan/math/prim.

Moreover, when I scaled by thread, I think I didn’t close them down properly, my wsl terminal was crashing, and I tried to run R and it ran 30,000 times, so I think I need to be more careful and figure out what this issue is.

Anyway, here are the results. First, no threads, increment N by powers of two, next, increment N by same amount, and then threads at powers of two: 2,4,8.

Any moreover, when I run large amounts of threads the performance could be affected by other processes on my local machine.

But I think this is worth it to clean up and use with Stan models.

no threads

### no threading
NO THREADING scaleN, exp(10)
N,noThreads,msInt,msDouble
1,NA,0ms,0.003524ms
2,NA,0ms,0.000186ms
4,NA,0ms,0.000151ms
8,NA,0ms,0.000164ms
16,NA,0ms,0.00026ms
32,NA,0ms,0.000308ms
64,NA,0ms,0.000511ms
128,NA,0ms,0.000913ms
256,NA,0ms,0.001732ms
512,NA,0ms,0.003478ms
1024,NA,0ms,0.006796ms
2048,NA,0ms,0.0137ms
4096,NA,0ms,0.027292ms
8192,NA,0ms,0.054998ms
16384,NA,0ms,0.161574ms
32768,NA,0ms,0.308698ms
65536,NA,0ms,0.665727ms
131072,NA,1ms,1.2923ms
262144,NA,2ms,2.73796ms
524288,NA,5ms,5.15844ms
1048576,NA,10ms,10.8343ms
2097152,NA,21ms,21.2058ms
4194304,NA,42ms,42.8954ms
8388608,NA,84ms,84.7653ms
16777216,NA,209ms,209.666ms
33554432,NA,436ms,436.234ms
67108864,NA,836ms,836.463ms
134217728,NA,1670ms,1670.12ms
268435456,NA,3304ms,3304.69ms
536870912,NA,6700ms,6700.5ms

threads in powers of 2, nThreads=2,4,8

all 10
N,nThreads,msInt,msDouble
2,2,0ms,0.026993ms
4,2,0ms,0.001381ms
8,2,0ms,0.001153ms
16,2,0ms,0.00162ms
32,2,0ms,0.00236ms
64,2,0ms,0.004747ms
128,2,0ms,0.007481ms
256,2,0ms,0.011293ms
512,2,0ms,0.016849ms
1024,2,0ms,0.025779ms
2048,2,0ms,0.073363ms
4096,2,0ms,0.029504ms
8192,2,0ms,0.050113ms
16384,2,0ms,0.136854ms
32768,2,1ms,1.20827ms
65536,2,0ms,0.663258ms
131072,2,1ms,1.11818ms
262144,2,2ms,2.27672ms
524288,2,4ms,4.13288ms
1048576,2,9ms,9.59274ms
2097152,2,13ms,13.7976ms
4194304,2,27ms,27.2234ms
8388608,2,53ms,53.6127ms
16777216,2,104ms,104.519ms
33554432,2,224ms,224.874ms
67108864,2,563ms,563.809ms
134217728,2,1134ms,1134.13ms
268435456,2,2417ms,2417.84ms
536870912,2,6183ms,6183.66ms
[       OK ] MathFunctions.expVecBench_N2_2_threads2_all10 (16724 ms)
[ RUN      ] MathFunctions.expVecBench_N2_2_threads4_all10
all 10
N,nThreads,msInt,msDouble
2,4,0ms,0.030837ms
4,4,0ms,0.001448ms
8,4,0ms,0.001295ms
16,4,0ms,0.001624ms
32,4,0ms,0.003657ms
64,4,0ms,0.005594ms
128,4,0ms,0.008186ms
256,4,0ms,0.012027ms
512,4,0ms,0.018287ms
1024,4,0ms,0.028714ms
2048,4,0ms,0.037289ms
4096,4,0ms,0.050509ms
8192,4,0ms,0.091414ms
16384,4,0ms,0.196654ms
32768,4,0ms,0.260902ms
65536,4,0ms,0.519282ms
131072,4,0ms,0.87683ms
262144,4,2ms,2.1367ms
524288,4,3ms,3.93407ms
1048576,4,7ms,7.76433ms
2097152,4,14ms,14.8215ms
4194304,4,29ms,29.6172ms
8388608,4,58ms,58.5754ms
16777216,4,113ms,113.321ms
33554432,4,220ms,220.497ms
67108864,4,435ms,435.551ms
134217728,4,884ms,884.131ms
268435456,4,1967ms,1967.68ms
536870912,4,5560ms,5560.8ms
[       OK ] MathFunctions.expVecBench_N2_2_threads4_all10 (15074 ms)
[ RUN      ] MathFunctions.expVecBench_N2_2_threads8_all10
all 10
N,nThreads,msInt,msDouble
2,8,0ms,0.161965ms
4,8,0ms,0.069863ms
8,8,0ms,0.021994ms
16,8,0ms,0.002051ms
32,8,0ms,0.002529ms
64,8,0ms,0.006821ms
128,8,0ms,0.007567ms
256,8,0ms,0.009711ms
512,8,0ms,0.01757ms
1024,8,0ms,0.031054ms
2048,8,0ms,0.022141ms
4096,8,0ms,0.041309ms
8192,8,0ms,0.053866ms
16384,8,0ms,0.145634ms
32768,8,0ms,0.239546ms
65536,8,0ms,0.498045ms
131072,8,0ms,0.971106ms
262144,8,1ms,1.72449ms
524288,8,3ms,3.62173ms
1048576,8,7ms,7.15323ms
2097152,8,13ms,13.4904ms
4194304,8,27ms,27.3854ms
8388608,8,52ms,52.3734ms
16777216,8,105ms,105.571ms
33554432,8,200ms,200.94ms
67108864,8,407ms,407.948ms
134217728,8,816ms,816.981ms
268435456,8,1939ms,1939.09ms
536870912,8,4744ms,4744.68ms
[       OK ] MathFunctions.expVecBench_N2_2_threads8_all10 (14071 ms)

And I’m pushing it to the same branch. Minor improvements at larger scales, but I think worth including and may have more advantages with more composite functions and more threads.

Edit: at larger N, threading seems to be more advantageous but smaller N, it’s more expensive, probably due to cost of distributing and collecting threads, but I need to take a closer look. Any comments appreciated.

Any suggestions are appreciated.

~ Regards