Historical record on including TBB

This is a copy of the discussion about adding the TBB dependency. It used to be at the stan-dev/math wiki page: Dependency Checklist for the Intel TBB. It’s now here because we’re cleaning up the wikis.

We want to answer the following questions:

  • What is the dependency used for?

    Enable thread based parallelism with load-balancing. The TBB provides a very rich feature set for this and is wildly used in scientific projects.

    Can you be more specific about the features needed from TBB? More specifically, what do we not get out of C++ threads?

C++ threads are rather low-level to program with. The C++11 standard allows for very limited abstraction through futures and the async facility - but that’s about it.

The Intel TBB provides a high-level approach to parallelize computational intensive tasks. See summary from Intel on the benefits of the TBB.. Overall the TBB provides a huge set of utilities which are useful to add parallelism to an existing programs - it’s fully complementary and is really designed to be added to an existing project. The features which I would use immediately are

  • work-balancing scheduler with work-stealing (no more need for sharding by users)
  • parallel map (for the TBB this is a parallel_for)
  • parallel reduce (the TBB provides a deterministic reduce)
  • scalable memory allocators (something I did not know about before even looking at the TBB, but heap allocation is a problem in threaded programs as there is only one heap and scalable memory allocators solve that by having thread specific heaps); interestingly does the use of the scalable memory allocator speed up already our non-threaded code; and it does speedup quite a bit threaded code in examples I tested on macOS and Linux.
  • supports nested parallelization which is important in order to let us add parallelism without any constraints at various levels. I do not think that we should emphasize nested parallelism, but to be flexible in how we go forward, this is important. For example, assume we have more super-nodes in the future which themselves parallelize. Having these in addition with a parallel_reduce will just work.

Further very useful things for the future:

  • thread safe containers of most standard containers
  • other means to parallelize, for example the flow graph allows a graphical representation of the program execution flow which is automatically translated into an optimized parallel work load (NuTS itself could be formulated like this and allow for parallel forward and backward integration once in a while).
  • Why do we need or want this dependency?

    Avoid having to reinvent the wheel for high-performance thread-based parallelism.

    The first answer didn’t mention anything about high-performance. So, what do you mean here? What features are we leveraging?

The TBB implements for every platform the fastest thing to do - which is in most cases the use of a threadpool. Other things we will get to use “for free” is core affinity scheduling being taken care of, for example.

  • Is the license compatible with BSD? Does it allow us to distribute its source and binaries?

    Apache 2.0 open-source license. That is OK to my knowledge.

additional point: The Apache 2.0 license is compatible with the GPL-3 as stated on gnu.org.

The Apache 2.0 license is less liberal in comparison to the BSD license as it says some things about patents. However, those statements only relate (to my understanding) to the TBB itself. Thus those patent considerations only apply to the TBB source which we would distribute with Stan as I understand.

  • How mature is the dependency? Very mature…it’s around for >10y

additional point: Intel does test the TBB only on Intel hardware, of course. Since it is at the end of the day just a C++ library this is not a problem, I suppose; but something to bear in mind (not testing on AMD hardware… which we probably don’t do ourselves at the moment).

  • Are there alternatives?

    Maybe this: http://stellar-group.org/libraries/hpx/ but I never got it to build

  • How often will we need to update the dependency? (How often has it changed over the last year?)

    3x Updates last year, but no need to update with every release.

  • How does this affect Math maintainers?

    • Will we need to change any of the source before we include it into the Math library?

      We need to take care of thread initialization / management / think about how to put those pieces together.

      Is this different than how we currently use threading?

Currently we only use threading in map_rect and there we create the threads ourselves. What is really bothering me is that the C++11 thread implementation of async leaves out all details as to how this is done. This is done on purpose to make it easy for the compiler vendors to implement it - but a threadpool would be much better in many circumstances. When we move to the TBB then I would suggest to use a so-called arena observer object. This thing gets called whenever the TBB adds a thread to one of its threading arenas (one can define multiple areans if you want to). When a thread enters the arena we would need to instantiate the thread-local AD tape instance such that reverse mode operations can be executed against an initialized AD tape. So yes, we would change map_rect to call a parallel_for from the TBB, then that gets scheduled to tasks which are dispatched to the thread active in the current arena.

  • Is this dependency header-only? Static library? Shared library?

    Headers + Shared library.

  • Do we need to change the build to get this to work?

    Yes. Will be handled like the MPI shared libraries with rpath coding.

additional point: I would strongly prefer to use the makefiles of the TBB which works out of the box on Linux and on macOS. However, on Windows we need to use mingw32-make which is required by the TBB makefiles. This make variant on Windows is more POSIX like as I understood. Everything works with this in our Jenkins test pipeline, since this mingw32-make is part of RTools, our main target on Windows.

  • How does this affect:
    • Math developers?

      For those working with threading stuff they need to pick up the framework. Everything else stays the same. The library will be a requirement to work with. Ideally it is not optional, but a must have.

      Can you put up a simple example of how threading is different under C++ threads and TBB? Once we decide to go TBB, does that mean we can not use C++ threads directly (are they mutually exclusive)?

First: The TBB is complementary - so we can continue to use threads as we have done so far. We could continue to use map_rect unchanged if we wanted to and this would work. In addition, for the TBB the parallelism is in itself optional. Thus we can write our program with the TBB and whether this will actually run with more than 1 core is something which the user decides at the very end.

The low-level object of the TBB is a “task” which is a piece of work to execute. In practice the TBB will use a threadpool to execute that, but that is an implementation detail and may actually change in the future if that’s better eventually. In most cases we will not need to code up tasks, but we can just use the high-level algorithms (parallel_for/reduce/scan or the flow-graph thing).

In case we use the task interface one certainly has to be careful in managing the dependencies correctly in order to avoid dead-locks.

  • Math users?

    Not affected, I think.

  • Stan?

    Not at all.

  • Stan interface users?


  • Stan interface maintainers?

    Integrate the library just like CVODES. Actually, RStan would just use RcppParallel, so it is more like BH and RcppEigen except that there is a shared library that has to be linked to. PyStan maintainers will probably build the library themselves, but talking to them it seems as they meet all the requirements (in particular on Windows wrt to the make and the compiler).

    Please update with current information

  • Can we still support all the same compilers we support now?

    • Test on Windows gcc 4.9.3 RcppParallel can build the TBB on Windows using RTools.
    • Test current default Mac XCode clang, Works, yes.
    • Test on current default Linux compiler; Works, yes.
  • How much time does this add to compiling the first model with:

    35s compile time without build paralellization; 11s with 4 cores

    • RStan? Not applicable (Windows & Mac use binaries)
    • CmdStan? Same story.
    • What about subsequent compile times? It adds an additional shared linking step. Don’t know how much (not a lot is my guess). can you update this with more than a guess?

What data are you looking for? I eye-balled the runtimes of our Jenkins on the TBB PR’s and there I do not see any noticeable increase in the overall time it takes to test things (and these do trigger the build of the TBB and the linking of the TBB).

  • Where do we want to use this? Does any other source code need to change?

    The math library needs changes and the interfaces need build system changes.

  • How difficult would it be to write our own version of just the functionality we require?

    Impossible! That’s way too much clever code there. We can allow for nested parallelism and even have the chance to parallelize NUTS itself using the dynamic graph building scheme from the TBB. Lot’s of very useful building blocks for parallelism are just there.

    Could you rewrite this answer? It’s not impossible. Could you be clearer about the technical claim? What is the functionality we require? Once that’s in place, we can actually have a discussion about whether that functionality is difficult.

Well, I do think that there are a lot of things about parallelism which we (or at least myself) don’t know ourselves, but these are important. So to give a complete list of the actual features needed is hard sine it is about the unkowns. An example for this are scalable memory allocators. I did not know about them, but these turn out to be very important for good performance.

Things coming to my mind:

  • scheduler implementing work-stealing technique
  • scalable memory allocators + memory pools
  • high-level parallel for/reduce/scan facilities
  • thread safe containers (vector)
  • combinable class which allows to easily implement accumulators which accumulate by thread and then you combine at the very end
  • very modular design of the library allows to combine different patterns as needed (you can do a deterministic reduce)
  • management of thread arenas => we could have one arena per Stan model where we run multiple chains within one arena. Whenever one chain finishes, the freed resource goes to the other chains. Multiple arenas can easily be setup to run multiple models (useful for httpstan, for example).