I think I made this faster. So I’m scaling number of threads.
I did a quick grep through the stan/math, and it doesn’t look like anyone’s actually using TBB for multithreading yet. I used:
Here are some results, after this test case:
TEST(MathFunctions, expVecBench) {
// std timing includes
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
// stan math includes
using stan::math::exp;
using stan::math::init_threadpool_tbb;
size_t N = 10000; // we're computing exp 10000 times but scaling number of threads
// scaling Nthreads by squares N, N^2, N^3
std::cout << "N,nThreads,msInt,msDouble\n";
for (int i = 1; i < 10; ++i) {
size_t Nthreads = 2;
Nthreads = std::pow(Nthreads, i);
stan::math::init_threadpool_tbb(Nthreads);
std::vector<double> vec(N);
for (size_t i = 0; i < N; ++i) {
vec[i] = i + 1;
}
std::vector<double> vec_test;
auto t1 = high_resolution_clock::now();
EXPECT_NO_THROW(vec_test = stan::math::exp_test(vec));
auto t2 = high_resolution_clock::now();
/* Getting number of milliseconds as an integer. */
auto ms_int = duration_cast<milliseconds>(t2 - t1);
/* Getting number of milliseconds as a double. */
duration<double, std::milli> ms_double = t2 - t1;
std::cout << N << ",";
std::cout << Nthreads << ",";
std::cout << ms_int.count() << "ms,";
std::cout << ms_double.count() << "ms\n";
}
}
results:
N,nThreads,msInt,msDouble
10000,2,0ms,0.161958ms
10000,4,0ms,0.11016ms
10000,8,0ms,0.069456ms
10000,16,0ms,0.145974ms
10000,32,0ms,0.06747ms
10000,64,0ms,0.074748ms
10000,128,0ms,0.06451ms
10000,256,0ms,0.089873ms
10000,512,0ms,0.065416ms
(you could repeat the experiment N times and get a MC estimate, but whatever)
And then without threading (removing the STAN_THREADS=true from make/local):
NO THREADING
N,noThreads,msInt,msDouble
10000,NA,0ms,0.104313ms
And I’m guessing the advantages of threading plateaus because it costs more time to send threads than the advantage of computing it with else threads, if you understand what I’m trying to communicate? More communication between threads costs more than running the actual computation.
I think this looks good. Any other tests people want to see? This is a first run. I should re-run tests and increase N, but this could possibly add speed at a lower level, math theory aside.
The branch is here, I’m about to push: https://github.com/drezap/math/tree/feature/issue-3311-test-thread-tbb-exp
here:
To github.com:drezap/math.git
7a955d9a52..0d66c6f4ba feature/issue-3311-test-thread-tbb-exp -> feature/issue-3311-test-thread-tbb-exp
I need to be more rigorous, but this is cool. If I’ve messed up, please let me know.
the files to check out are: test/unit/math/prim/fun/exp_test.cpp and stan/math/prim/fun/exp.hpp. I need to edit function name but that’s an easy fix.
Also, looking at my max amount of threads, locally:
andre@compy:~/stan-dev/math$ cat /proc/sys/kernel/threads-max
255162
~ Regards