Parallel STL

@wds15 mentioned the idea of using the parallel versions of std algorithms (mainly for_each, reduce, transform, etc.) instead of OpenMP (actually the parallel versions of std algorithms use OpenMP under the hood) like I did on.

I thought this suggestion was impractical because neither g++ nor clang++ implements the parallel versions of std algorithms yet, despite its being part of the C++17 standard. However, it seems as if g++ is in the process of merging Intel’s implementation of the parallel versions of std algorithms, so that raises the possibility that we could just include


in the Stan Math sources like Boost and Eigen. Intel’s implementation only requires C++11 and OpenMP 4.0, so we should be good there (except with Apple’s Xcode). I have run one example on Windows with g++-4.9.

4 Likes

Hi!

If we use this type of parallelism, then I would favor using C++17 standards by far over going with OpenMP. It is great to see g++ picking up the parallel STL. Including Intel’s parallelstl into Stan sounds like a way to short-cut the waiting time.

However, I have to say that I looked into the usage of these facilities - and my conclusion was that it will not be straightforward to use them unless you sacrifice exact reproducibility whenever this parallelism is turned on. The for_each could be used to speed up lpdf calculations. The issue is that we do need to accumulate some terms over all the individual elements. If you now introduce asynchronous parallelism, then the accumulation over all elements is not any more guaranteed to give you every time the same result.

Maybe the parallel STL has some support for thread-aware accumulation in a numerically exactly reproducible way? One solution is, of course, to say that we do not support exact reproducibility for this type of parallelism.

My summary a while ago was that this parallelism is certainly very convenient for users (parallel lpdf statements)… but this should only pay off for very large data sets. For those really large data sets you can also code up a map_rect call. That is more pain, yes, but it would give you exact reproducibility and about the same speedup.

(BTW, it would be interesting to see how our map_rect performs with Intel’s async implementation…I would hope that this performs better through the use of a thread-pool)

Sebastian

1 Like

People who are required to have such strict reproducibility can turn the parallelism off for the final version. But I am not understanding how using reduce in a _lpdf would be non-deterministic. Which thread gets which iterations of the loop is non-deterministic, but the calculations each thread does are. Are you saying that the way in which the terms get added up in the end is numerically non-commutative?

Yes, that is what I am saying. Floating point arithmetic is non-commutative, the order matters.

…but maybe we can do the accumulation in a deterministic order? We are probably not the first to have this kind of problem, so maybe there is a solution. I certainly agree with, that super-concerned people like me can just turn this type of parallelism off (so we need to ensure that this feature is optional - just like anything else).

1 Like

If you pull in the parallel STL, then you also need the TBB from Intel… on one hand this would be not so nice to have yet one more library, but it looks to me as if they thought deeply about the problem to parallelize. For example, they have concepts for chunking, etc.:

https://software.intel.com/en-us/node/506057

So if we want to go crazy with parallelism, then the TBB is on its own interesting. The TBB is under the Apache 2.0 license which should be ok, I think.

1 Like

Yeah, I had to do some stuff to get an example working but it does work.

This is probably the most obvious when you tickle .Machine$double.eps in R:

i = 0
while((i + 5 * .Machine$double.eps + 5 * .Machine$double.eps) != i) {i = i + 1; print(i)}

i = 0
while((5 * .Machine$double.eps + 5 * .Machine$double.eps + i) != i) {i = i + 1; print(i)}

In longer sequences this works with larger eps values.

1 Like

The big question in my mind is how we’re going to configure all these layers of parallelism.

I think this is the right solution as opposed to trying to make multithreaded code bit-level reproducible.

IEEE floating point is commutative (apparently modulo some NaN hacks):

a + b = b + a

It is not transitive, so that in general, this does not hold:

(a + b) + c = a + (b + c).

What is a TBB? I’m drowning in three-letter acronyms on the forums here.

Is this another case where Intel’s going to mess with the floating-point standard?

2 Likes

Threading Building Blocks.

I don’t think so. The Intel compiler handles floating point arithmetic differently than g++ or clang++ but this is Intel’s implementation of the part of the C++17 standard library that hasn’t been implemented yet and is being merged by GCC.

The Intel TBB looks really promising to me! It did not take me too long to port map_rect to use the parallel_for construct of the TBB. The results I have done on a small example look promising:

  • single-threaded: 75s
  • C++ threading: 27s
  • TBB: 22s

This did run on a 6-core machine and the 27s vs 22s is about the efficiency loss which I usually see between MPI and threading (about 20%). We need more tests, but it looks like the threading performance penalty goes away with the Intel TBB! That would be really cool.

… also: It may actually not stop there. Right now I am still relying on the C++11 thread_local keyword - but I would not be surprised if the Intel TBB supplied thread local thing is far more efficient and has almost no overhead.

BTW, since the Intel TBB does automatic chunking (which is probably much smart than our stuff right now) we are getting lots of things for free here, I think (the docs mention cache affinity quite a few times).

By now I am almost leaning towards going all in with the TBB as opposed to sticking with the parallel STL. The Intel TBB looks far more mature and offers a lot more control as to what is done how. For example, the library brings a thread aware memory allocator.

The map_rect using the TBB is here:

If you want to run it, then install TBB and put into your make/local a -include make/tbb … I have set this up to work with macOS… also check the make/tbb for correctness given your system.

2 Likes

Cool… reading through the Intel docs for their memory allocator I realized that memory allocation on the global heap is a bottleneck. So even with a thread_local AD stack all those gazillion new calls required to create the AD tape do compete for the very same global heap. The Intel folks have created a replacement library for the new operation which one can just load along with the program. Doing so speeds up a multi-core run by ~25% … and I haven’t changed any line of my program! Sounds like it is worthwhile to spend some time on this.

Yes. All else equal, it is probably best to utilize the functions in the C++17 standard. Can you get the same speedup from a reduce in parallelstl as with the lower-level TBB functions? I should think it compiles to about the same thing since parallelstl is using TBB.

We need to be careful on terminology here. Stan uses a stack-based allocator, rather than a heap. It allocates memory using the underlying C++ heap, but given that it expands exponentially, there will only be a few calls to the underlying heap memory allocator.

After that, every call to new in an extension of vari uses the stack, not the heap. This stack does need to be synchronized because order matters (specifically, it needs to be topologically sorted w.r.t. the directed acyclic expression graph).

Getting the PSTL to work is a bit of a mess: You need the Intel TBB which is setup to build on Mac with the clang compiler from Apple. However, for the PSTL you have to use a OpenMP 4.0 compliant compiler… which is not the Apple clang (only supports 3.X). So I used a g++ 7 from MacPorts to compile the Stan program.

Sure. but I noticed that when loading custom memory managment libraries improves performance under threading. In the meantime I found Hoard (http://hoard.org/) which is a GPLed memory allocator designed for threaded applications… and it gives me about 25% performance gains literally just by loading this library and not changing any line of my codes. So we should probably document for users how to set this up to speed up things. From reading through the docs, the allocator manages much better memory allocation wrt to cache locality (for a given running thread) - this is something which is not taken care of at all with the standard STL, I guess.

I increased a bit the problem benchmark size and I am getting a consistent edge of Intel TBB directly over the PSTL:

PSTL:
real	2m49.168s
user	22m8.761s

Intel TBB:
real	2m34.776s
user	20m21.620s

Intel TBB with Hoard:
real	2m0.320s
user	16m49.180s

PSTL with Hoard:
real	2m19.731s
user	19m19.224s

Intel TBB with Hoard (2nd run):
real	2m0.745s
user	17m2.926s

So Intel TBB vanilla is clearly best and Hoard gives nice performance gains.

I forgot to add where we are speed-wise with what we have currently:

MPI, 6 cores:
real	2m21.527s
user	14m2.403s

MPI, 6 cores with Hoard:
real    1m47.244s
user   10m38.973s

threads, current:
real	3m1.228s
user	22m2.202s

threads, with Hoard
real	2m13.770s
user	15m41.609s

Thus, the current thread implementation is really slow, but we can massively speed things up with Hoard…which holds for MPI as well.

This suggests that this Hoard memory allocator will give any Stan program considerable performance gains. We should run the performance benchmark suite with and without Hoard to follow-up on that.

1 Like

I am not an expert in this stuff… but I think that in the multi-threaded cases things are different: The thread local AD stacks gets killed each time a thread is teared down. When this tear down happens (or if) depends on the implementation details of async. Whenever the tear-down happens, the arena allocator needs to start from scratch.

Thus, it is crucial to use a so-called scalable allocator which takes care of per-thread memory allocation in an optimal way without too much contention or so-called false sharing. At least this is why I think that using Hoard gives me so much of a performance boost. This boost can easily be Mac specific if other OSes bring along a better allocator with them.

With threading things are a bit more involved - maybe we can do something about this with a better arena allocator?

EDIT: addition…

So… I have tested in addition now the Intel Parallel STL 2019 Update 3 in its binary edition. The speedups get even better:

Intel PSTL with linked tbbmalloc and Hoard:
real  1m51.540s
user  15m50.995s

As I think this should become sooner rather than later part of Stan, I filed a PR which allows the user to optionally use a C++17 implementation - either the vanilla C++17 if you happen to have a compiler which is up to it or via the Intel Parallel STL. This is quite nice … we are gaining more than 30% in speed for this toy examples when comparing against where we started (3 minutes vs 1:50min now).

See here the PR on github .

Do we tear down threads or pool them? I could see arguments for both.

What properties would it need to have?

Our std::async facility from C++11 leaves all of these details open to the specific implementation on a given OS & compiler. The Intel TBB implements a thread pool and manages these threads for us.

The allocator needs to be “scalable” and avoid “false sharing”, see here.

However, probably the arena allocator is not a problem, because allocations happen very rarely due to the exponential growth and the lazy deallocation. Probably the use of all temporaries is more of a problem. For that we can opt for replacing the system malloc calls with something more clever. So far I have seen best performance when using Hoard, but maybe the Intel TBB can provide a similar performance gain (it also brings a malloc replacement).

Just a reminder that parallel std provides a subset of OpenMP functionality.

I am strongly consider the Intel TBB as basis for our threading capabilities. Any thoughts on that?

I prefer it a lot over OpenMP since it gives a much nicer C++ API rather than some compiler macros.