Parallel STL

wds15 · December 8, 2018, 5:53pm

Getting the PSTL to work is a bit of a mess: You need the Intel TBB which is setup to build on Mac with the clang compiler from Apple. However, for the PSTL you have to use a OpenMP 4.0 compliant compiler… which is not the Apple clang (only supports 3.X). So I used a g++ 7 from MacPorts to compile the Stan program.

Sure. but I noticed that when loading custom memory managment libraries improves performance under threading. In the meantime I found Hoard (http://hoard.org/) which is a GPLed memory allocator designed for threaded applications… and it gives me about 25% performance gains literally just by loading this library and not changing any line of my codes. So we should probably document for users how to set this up to speed up things. From reading through the docs, the allocator manages much better memory allocation wrt to cache locality (for a given running thread) - this is something which is not taken care of at all with the standard STL, I guess.

I increased a bit the problem benchmark size and I am getting a consistent edge of Intel TBB directly over the PSTL:

PSTL:
real	2m49.168s
user	22m8.761s

Intel TBB:
real	2m34.776s
user	20m21.620s

Intel TBB with Hoard:
real	2m0.320s
user	16m49.180s

PSTL with Hoard:
real	2m19.731s
user	19m19.224s

Intel TBB with Hoard (2nd run):
real	2m0.745s
user	17m2.926s

So Intel TBB vanilla is clearly best and Hoard gives nice performance gains.

Topic		Replies	Views
Parallelization of large vectorized expressions Developers	9	1016	August 12, 2018
Parallel reduce in the Stan language Developers	12	1235	April 11, 2019
Stanc3 Math lib opencl integration Developers	29	1244	September 23, 2019
Compiling Stan against Intel MKL General	9	2051	October 16, 2017
Proposed parallelism RFC - Stan language bits Developers	14	1044	July 9, 2019

Parallel STL

Related topics