Benefits of parallelization with a threadpool of the Intel TBB

Hi!

I have been advocating the use of a threadpool as implemented in the Intel TBB as opposed to using our current C++11 threading model which creates new threads everytime one dispatches asynchronous work. I finally have a good way to benchmark this and to observe the speed difference.

I am using as example a hierarchical Poisson reduce with G=500 groups and N=10 observations per group. The Stan code is below. I am using 4 different approaches:

  1. TBB parallel reduce (very easy to write down for the user)
  2. TBB parallel map (somewhat more data-prep required)
  3. vanilla map_rect (codes in the same way as the map, so data-prep is needed and it’s klunky to code with)
  4. serial code written in a very straightforward easy way

The results are here:



To me they show that:

  • the TBB with the threadpool has clearly the best scaling behavior and outperform the current C++11 thread based approach
  • the TBB parallel map 1 core performance is faster than the serial code, probably due to better/smaller AD graph footprint at the cost of somewhat cumbersome coding
  • the map_rect is given the scalable memory allocator form the TBB, so without the TBB (as of now in Stan) things will even be worse for map_rect

What is super amazing here is that the execution time on 6 cores is 5x faster than the respective serial program - that is 70s run time go down to just 14s!

Here are the codes which run with the parallel-ad-tape-3 branch on stan-math:

functions {
  // runs reduce over the index range start to end. Mapping from
  // data-index set to group indices pre-stored in gidx
  real hierarchical_reduce(int start, int end, int[] y, vector log_lambda_group, int[] gidx) {
    return poisson_log_lpmf(y[start:end]| log_lambda_group[gidx[start:end]]);   
  }

  // function defined in C++ which calls parallel_reduce_sum with a
  // lambda functor which simply binds the arguments passed into it
  real parallel_hierarchical_reduce(int[] y, vector log_lambda_group, int[] gidx, int grainsize);

  // map-rect alterantive
  vector mr_reduce(vector beta, vector log_lambda, real[] xr, int[] xi) {
    real lp = poisson_log_lpmf(xi| log_lambda[1]);
    return [lp]';
  }
  
  real hierarchical_map(int g, vector log_lambda, int[,] yg) {
    real lp = poisson_log_lpmf(yg[g]| log_lambda[g]);
    return lp;
  }
  
  // parallel_map TBB based alternative
  real[] parallel_hierarchical_map(int[] group, vector log_lambda, int[,] yg);
}
data {
  int<lower=0> N;
  int<lower=0> G;
  int<lower=0> grainsize;
  int<lower=0,upper=3> method;
}
transformed data {
  real true_log_lambda = log(5.0); 
  real true_tau = log(10)/1.96;
  int y[N*G];
  int yg[G,N];
  real xr[G,0];
  int gidx[N*G]; 
  int start[G];
  int end[G];
  int group[G];
  vector[0] theta_dummy;

  print("Simulating problem size: G = ", G, "; N = ", N);
  if (method == 0) {
    print("Using parallel reduce TBB with grainsize = ", grainsize);
  } else if (method == 1) {
    print("Using parallel map TBB");
  } else if (method == 2) {
    print("Using map_rect.");
  } else if (method == 3) {
    print("Using serial evaluation only.");
  }

  for(g in 1:G) {
    real lambda_group = lognormal_rng(true_log_lambda, true_tau);
    int elem = 1;
    group[g] = g;
    start[g] = (g-1) * N + 1;
    end[g] = g * N;
    for(i in start[g]:end[g]) {
      y[i] = poisson_rng(lambda_group);
      yg[g,elem] = y[i];
      gidx[i] = g;
      elem += 1; 
    }
  }
}
parameters { 
  real log_lambda;
  real<lower=0> tau; 
  vector[G] eta;
}
model {
  vector[G] log_lambda_group = log_lambda + eta * tau;
  real lpmf = 0;

  if (method == 0) {
    lpmf = parallel_hierarchical_reduce(y, log_lambda_group, gidx, grainsize);
  } else if (method == 1) {
    lpmf = sum(parallel_hierarchical_map(group, log_lambda_group, yg));
  } else if (method == 2) {
      vector[1] log_lambda_group_tilde[G];
      for(g in 1:G)
        log_lambda_group_tilde[g,1] = log_lambda_group[g];
      lpmf = sum(map_rect(mr_reduce, theta_dummy, log_lambda_group_tilde, xr, yg));
  } else if (method == 3) {
      lpmf = poisson_log_lpmf(y| log_lambda_group[gidx]);
  }

  target += lpmf;
  target += std_normal_lpdf(log_lambda);
  target += std_normal_lpdf(tau);
  target += std_normal_lpdf(eta);
}
generated quantities {
  real msq_log_lambda = square(true_log_lambda - log_lambda);
  real msq_tau = square(true_tau - tau);
}
#include <stan/math.hpp>

#include <boost/iterator/counting_iterator.hpp>
#include <stan/math/prim/scal/functor/parallel_reduce_sum.hpp>
#include <stan/math/rev/scal/functor/parallel_reduce_sum.hpp>
#include <stan/math/prim/scal/functor/parallel_for_each.hpp>
#include <stan/math/rev/scal/functor/parallel_for_each.hpp>

namespace poisson_hierarchical_scale_model_namespace {

// user provided function
template <typename T3__>
typename boost::math::tools::promote_args<T3__>::type
hierarchical_reduce(const int& start,
                        const int& end,
                        const std::vector<int>& y,
                        const Eigen::Matrix<T3__, Eigen::Dynamic, 1>& log_lambda_group,
                    const std::vector<int>& gidx, std::ostream* pstream__);

template <typename T3>
inline typename boost::math::tools::promote_args<T3>::type
parallel_hierarchical_reduce(const std::vector<int>& y,
                             const Eigen::Matrix<T3, Eigen::Dynamic, 1>& log_lambda_group,
                             const std::vector<int>& gidx,
                             const int& grainsize,
                             std::ostream* pstream__) {
  typedef boost::counting_iterator<int> count_iter;
  typedef typename boost::math::tools::promote_args<T3>::type return_t;
  const int elems = y.size();

  typedef typename boost::math::tools::promote_args<T3>::type local_scalar_t__;
  typedef local_scalar_t__ fun_return_scalar_t__;

  // C++ only binds with a lambda the arguments of the user function
  return_t lpmf = stan::math::parallel_reduce_sum(
      count_iter(1), count_iter(elems+1), return_t(0.0),
        [&](int start, int end) {
          return hierarchical_reduce(start, end, y, log_lambda_group, gidx, pstream__);
        }, grainsize);


  /* C++ only version
  return_t lpmf = stan::math::parallel_reduce_sum(
      count_iter(1), count_iter(elems+1), return_t(0.0),
        [&](int start, int end) {
          const std::size_t partial_elems = end-start+1;
          std::vector<int> partial_data;
          partial_data.insert(partial_data.end(), y.begin() + start - 1,
                              y.begin() + end);
          std::vector<return_t> partial_log_lambda(partial_elems);
          for(std::size_t i = 0; i != partial_elems; ++i)
            partial_log_lambda[i] = log_lambda_group[gidx[start + i-1]-1];
          return stan::math::poisson_log_lpmf(partial_data, partial_log_lambda);
        }, grainsize);
  */
  
  return lpmf;
}

template <typename T1__>
typename boost::math::tools::promote_args<T1__>::type
hierarchical_map(const int& g,
                     const Eigen::Matrix<T1__, Eigen::Dynamic, 1>& log_lambda,
                 const std::vector<std::vector<int> >& yg, std::ostream* pstream__);

template <typename T1__>
std::vector<typename boost::math::tools::promote_args<T1__>::type>
parallel_hierarchical_map(const std::vector<int>& group,
                          const Eigen::Matrix<T1__, Eigen::Dynamic, 1>& log_lambda,
                          const std::vector<std::vector<int> >& yg, std::ostream* pstream__) {
  
  typedef typename boost::math::tools::promote_args<T1__>::type return_t;
  const int elems = group.size();

  // C++ only binds with a lambda the arguments of the user function
  std::vector<return_t> lpmf = stan::math::parallel_map(
      group.begin(), group.end(),
        [&](int g) -> return_t {
          return hierarchical_map(g, log_lambda, yg, pstream__);
        });
  
  return lpmf;
}

}
5 Likes

Really excellent work! TBB really looks like the way to go.
One question: Does TBB influence the performance on non-Intel CPU’s?

The Intel TBB is after all a C++ library. So it depends on how well your compiler optimizes for your platform.

However, it is correct that the Intel TBB uses Intel specific instructions for their scalable memory allocator, for example. This is an important piece for good threading performance. By now I have tried out the scalable memory allocator with serial (non-threaded) Stan programs and found a ~7% performance boost on macOS with Intel. We have run the same benchmark on Linux with AMD and found a ~5% performance loss in this case. I am still optimistic that in the threaded case the scalable memory allocator of the TBB is still a good thing - even on AMD.

As I don’t have an AMD system available, it would certainly be nice to verify the merits which I am seeing. I would expect that we get very similar performance gains; maybe slightly less good, but certainly still worth it.

I could make the above example runnable in a relatively easy way in case people volunteer to run the benchmark on AMD. That would be very helpful in ensuring that what I am expecting is indeed correct. So if people like to contribute, then please raise your virtual hand here.

1 Like

Can you make a branch on the perf testing repo that I can run?

I can also test on my system (ryzen 2700x) if you put up an example to run

Sorry for taking some time here and thanks for the offer to run on AMD. I prepared a end-to-end example from above on my github. This should fire things up exactly as I have it running here:

git clone https://github.com/wds15/cmdstan.git cmdstan-wds15
cd cmdstan-wds15
git checkout parallel-ad
git submodule update --init --recursive
cd examples/poisson-bench
./benchmark-poisson-scaling.R

You would need R installed in your path. Probably you want to run the R file interactively to make some adjustments like changing the maximal number of cores and all that.

Please include in your post your CPU info and OS.

I am running on macOS Mojave, 6-core i9 2.9GHZ (I9-8950HK).

Thanks!

(This is a nice test if makefiles are relatively bullet proof as this will ship a TBB to you, compile & link it).

Thats great! I’m away from my main computer for a couple of days, but once I get back I can test under Linux and Windows

No hurry.

On windows please use mingw32-make to build the stuff (I will make that automatic at some point, but for now use that please on Windows - it’s part of RTools).

Results on Kubuntu 19.04 with an 8-core Ryzen 2700X and clang8.0:


4 Likes

Very cool. Many thanks! Did you have trouble getting this to run or was it smooth?

These results are awesome… they show that the scaling is better on AMD than on Intel. Looks like the Intel TBB is just as good on AMD as it is on Intel… the results even suggest that the AMD CPU is better suited for Stan programs (at 6 cores you get 4.5x on AMD while on Intel I got 3.8x). What are the cache sizes of this Ryzen?

1 Like

No troubles on Linux, there are some issues with the Windows setup but I haven’t looked at it properly. Will post more details on that one tomorrow.

The cache sizes are:

  • L1: 768kb
  • L2: 4Mb
  • L3: 16Mb

Argh… on Windows we cannot use the rpath linking trick which works nicely on Linux and macOS. For the moment being you should simply copy the dll files of the tbb into the same directory of the stan binary. So grab from stan/lib/stan_math/lib/tbb_2019_U6/lib/ all dll files and copy them into the same directory as the stan binary and it should work.

The alternative to copying the dll files around is to redefine the PATH variable such that it includes (as absolute path) the absolute/path/to/cmdstan/stan/lib/stan_math/lib/tbb_2019_U6/lib/.

Just ran this! System spec below. This was the only thing running on my computer but it looks like something weird is going on with map_rect?

lscpu info: When I ran this the CPU was at about 3 Mhz

Is this running with the TBB malloc? I might swap that out for another one to see if that is why I don’t get the linear speedups you and Andrew get.

lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       43 bits physical, 48 bits virtual
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               8
Model name:          AMD Ryzen Threadripper 2950X 16-Core Processor
Stepping:            2
CPU MHz:             1712.447
CPU max MHz:         3500.0000
CPU min MHz:         2200.0000
BogoMIPS:            6986.19
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7,16-23
NUMA node1 CPU(s):   8-15,24-31

Here’s more general information about the computer

sudo inxi --admin -C  -m -M -F -xxx -z
System:  Kernel: 5.0.0-20-generic x86_64 bits: 64 compiler: gcc v: 8.3.0 Console: tty 2 dm: GDM3 3.32.0 
           Distro: Ubuntu 19.04 (Disco Dingo) 
Machine:   Type: Desktop Mobo: ASUSTeK model: ROG STRIX X399-E GAMING v: Rev 1.xx serial: <filter> UEFI: American Megatrends 
           v: 1002 date: 02/15/2019 
Memory:    RAM: total: 94.33 GiB used: 2.60 GiB (2.8%) 
           Array-1: capacity: 512 GiB slots: 8 EC: None max module size: 64 GiB note: est. 
           Device-1: DIMM 0 size: No Module Installed 
           Device-2: DIMM 1 size: No Module Installed 
           Device-3: DIMM 0 size: 16 GiB speed: 2400 MT/s type: DDR4 detail: synchronous unbuffered (unregistered) 
           bus width: 64 bits total: 64 bits manufacturer: N/A part-no: CT16G4DFD824A.C16FBD serial: <filter> 
           Device-4: DIMM 1 size: 16 GiB speed: 2400 MT/s type: DDR4 detail: synchronous unbuffered (unregistered) 
           bus width: 64 bits total: 64 bits manufacturer: N/A part-no: CT16G4DFD824A.C16FBD serial: <filter> 
           Device-5: DIMM 0 size: 16 GiB speed: 2400 MT/s type: DDR4 detail: synchronous unbuffered (unregistered) 
           bus width: 64 bits total: 64 bits manufacturer: N/A part-no: CT16G4DFD824A.C16FBD serial: <filter> 
           Device-6: DIMM 1 size: 16 GiB speed: 2400 MT/s type: DDR4 detail: synchronous unbuffered (unregistered) 
           bus width: 64 bits total: 64 bits manufacturer: N/A part-no: CT16G4DFD824A.C16FBD serial: <filter> 
           Device-7: DIMM 0 size: 16 GiB speed: 2400 MT/s type: DDR4 detail: synchronous unbuffered (unregistered) 
           bus width: 64 bits total: 64 bits manufacturer: N/A part-no: CT16G4DFD824A.C16FBD serial: <filter> 
           Device-8: DIMM 1 size: 16 GiB speed: 2400 MT/s type: DDR4 detail: synchronous unbuffered (unregistered) 
           bus width: 64 bits total: 64 bits manufacturer: N/A part-no: CT16G4DFD824A.C16FBD serial: <filter> 
CPU:       Topology: 16-Core (2-Die) model: AMD Ryzen Threadripper 2950X bits: 64 type: MT MCP MCM arch: Zen+ family: 17 (23) 
           model-id: 8 stepping: 2 microcode: 800820B L1 cache: 1536 KiB L2 cache: 8192 KiB L3 cache: 32.0 MiB 
           flags: lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 223558 
           Speed: 1994 MHz min/max: 2200/3500 MHz boost: enabled Core speeds (MHz): 1: 1884 2: 1887 3: 1887 4: 1888 5: 1890 
           6: 2156 7: 1889 8: 1888 9: 1973 10: 1985 11: 1965 12: 1971 13: 1928 14: 1970 15: 1991 16: 1968 17: 1967 18: 1973 
           19: 1968 20: 1962 21: 1965 22: 1968 23: 2182 24: 1888 25: 1965 26: 1888 27: 1889 28: 1885 29: 1889 30: 1887 
           31: 1886 32: 1887 
           Vulnerabilities: Type: l1tf status: Not affected 
           Type: mds status: Not affected 
           Type: meltdown status: Not affected 
           Type: spec_store_bypass mitigation: Speculative Store Bypass disabled via prctl and seccomp 
           Type: spectre_v1 mitigation: __user pointer sanitization 
           Type: spectre_v2 mitigation: Full AMD retpoline, IBPB: conditional, STIBP: disabled, RSB filling 
2 Likes

@andrjohns can you run the inxi command I posted above? Would be interested in seeing your specs.

Sure thing!

lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       43 bits physical, 48 bits virtual
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               8
Model name:          AMD Ryzen 7 2700X Eight-Core Processor
Stepping:            2
CPU MHz:             2588.205
BogoMIPS:            7385.07
Virtualisation:      AMD-V
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-15
sudo inxi --admin -C  -m -M -F -xxx -z
System:    Kernel: 5.0.0-21-generic x86_64 bits: 64 compiler: gcc v: 8.3.0 
           Console: tty 1 wm: kwin_x11 dm: SDDM Distro: Ubuntu 19.04 (Disco Dingo) 
Machine:   Type: Desktop Mobo: ASRock model: AB350 Pro4 serial: <filter> 
           UEFI [Legacy]: American Megatrends v: P5.10 date: 10/16/2018 
Memory:    RAM: total: 62.91 GiB used: 1.96 GiB (3.1%) 
           Array-1: capacity: 256 GiB slots: 4 EC: None max module size: 64 GiB note: est. 
           Device-1: DIMM 0 size: 16 GiB speed: 2134 MT/s type: DDR4 
           detail: synchronous unbuffered (unregistered) bus width: 64 bits total: 64 bits 
           manufacturer: N/A part-no: CMK32GX4M2A2666C16 serial: N/A 
           Device-2: DIMM 1 size: 16 GiB speed: 2134 MT/s type: DDR4 
           detail: synchronous unbuffered (unregistered) bus width: 64 bits total: 64 bits 
           manufacturer: N/A part-no: CL17-17-17 D4-2400 serial: N/A 
           Device-3: DIMM 0 size: 16 GiB speed: 2134 MT/s type: DDR4 
           detail: synchronous unbuffered (unregistered) bus width: 64 bits total: 64 bits 
           manufacturer: N/A part-no: CMK32GX4M2A2666C16 serial: N/A 
           Device-4: DIMM 1 size: 16 GiB speed: 2134 MT/s type: DDR4 
           detail: synchronous unbuffered (unregistered) bus width: 64 bits total: 64 bits 
           manufacturer: N/A part-no: CL17-17-17 D4-2400 serial: N/A 
CPU:       Topology: 8-Core model: AMD Ryzen 7 2700X bits: 64 type: MT MCP arch: Zen+ 
           family: 17 (23) model-id: 8 stepping: 2 microcode: 800820B L1 cache: 768 KiB 
           L2 cache: 4096 KiB L3 cache: 16.0 MiB 
           flags: lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm bogomips: 118161 
           Speed: 2498 MHz min/max: N/A Core speeds (MHz): 1: 2498 2: 2525 3: 2460 4: 2477 
           5: 4297 6: 4306 7: 2610 8: 2580 9: 2907 10: 3005 11: 2473 12: 2447 13: 2386 
           14: 2373 15: 2896 16: 2734 
           Vulnerabilities: Type: l1tf status: Not affected 
           Type: mds status: Not affected 
           Type: meltdown status: Not affected 
           Type: spec_store_bypass 
           mitigation: Speculative Store Bypass disabled via prctl and seccomp 
           Type: spectre_v1 mitigation: __user pointer sanitization 
           Type: spectre_v2 
           mitigation: Full AMD retpoline, IBPB: conditional, STIBP: disabled, RSB filling
1 Like

Thanks for running this. It looks great to my eyes (for the TBB for sure). A few notes:

  • yes, the tbbmalloc is linked in automatically. I would expect that not using tbbmalloc is not a good thing for the multi-threading bits - but validating that would be valuable, of course (you would need to deal with the makefiles, though)

  • the linear scaling seen so far was only up to 6 (me) or 8 (@andrjohns) cores. You have a much bigger machine as it seems with 32 cores… maybe you increase the problem size by a factor of 5 at least, I would say

  • for map_rect: Your results are indeed odd as it should at least speed up to some degree, but maybe that’s a speciality of your system? This can possibly improve with larger problem size and/or with a different compiler, since the map_rect stuff relies on whatever the compile provides as an implementation.

All in all I do think that this looks very good. Results on Windows would be nice to have given that so many users work on Windows and that OS has terrible thread startup times as I have read from the TBB doc (and the thread pool will avoid that, of course).

EDIT: And how many actual cores do you have? It’s 16 - right? the 32 cores are only hyper-threading as it looks. With a 6-core machine I am stating, then I mean 6 physical cores; I haven’t explored the utility of hyper threading which does not seem like its useful for Stan from your results.

Trying to get this setup on Windows, the tbb_2019_U6/lib folder doesn’t exist from a fresh clone and there’s a weird compiler error when I try to build it:

C:\cmdstan-wds15\stan\lib\stan_math\lib\tbb_2019_U6>make all
'\"cscript /nologo /E:jscript ./build/detect.js  /arch \""' is not recognized as an internal or external command,
operable program or batch file.
'\"cscript /nologo /E:jscript ./build/detect.js  /runtime \""' is not recognized as an internal or external command,
operable program or batch file.
build/common.inc:81: *** Architecture not detected.  Stop.

@andrjohns Pretty similar specs! Looking over the results they seem to be pretty similar as well

yes, the tbbmalloc is linked in automatically. I would expect that not using tbbmalloc is not a good thing for the multi-threading bits - but validating that would be valuable, of course (you would need to deal with the makefiles, though)

I ran these tests with mimalloc as well and got the same performance so that’s probably not the 'ish I thought it was.

the linear scaling seen so far was only up to 6 (me) or 8 (@andrjohns) cores. You have a much bigger machine as it seems with 32 cores… maybe you increase the problem size by a factor of 5 at least, I would say

It’s 16 cores, the ‘hype’ in hyperthreading is v real for stan programs. I’ll increase the problem size and post results. What does the grainsize argument do?

for map_rect: Your results are indeed odd as it should at least speed up to some degree, but maybe that’s a speciality of your system? This can possibly improve with larger problem size and/or with a different compiler, since the map_rect stuff relies on whatever the compile provides as an implementation.

Maybe mpi takes a long time to spinup on my machine.

Agree results look v good!

the folder exists after you build this - and for the first build you have to use

mingw32-make build

after that you should see the tbb in the directory I pointed out. This make is part of RTools.

EDIT: It took me ages to figure out what the f*** matter is with the make. The errors you see are coming from the wrong make being used - you need the mingw32-make. I plan to make the use of this mingw32 make automatic whenever we are on Windows. More makefile magic. Yack.

grainsize defines how the work is chunked. On earlier versions of this I had to play with this (when I used the TBB reduce). However, I am now not anymore using the TBB reduce, but instead chunk the work myself into big enough work chunks. That works a lot better from my experience. So for now, the grainsize is set to 0 which will split the problem into equally sized chunks given by dividing the the problem into as many chunks as you give it cores.

I would not mess with grainsize for the moment being. We will need to decide at some point if we include it as an option. Probably it makes sense to include it for problems where we have multiple layers of parallelism; then choosing grainsize deliberately may be important.