Getting the PSTL to work is a bit of a mess: You need the Intel TBB which is setup to build on Mac with the clang compiler from Apple. However, for the PSTL you have to use a OpenMP 4.0 compliant compiler… which is not the Apple clang (only supports 3.X). So I used a g++ 7 from MacPorts to compile the Stan program.
Sure. but I noticed that when loading custom memory managment libraries improves performance under threading. In the meantime I found Hoard (http://hoard.org/) which is a GPLed memory allocator designed for threaded applications… and it gives me about 25% performance gains literally just by loading this library and not changing any line of my codes. So we should probably document for users how to set this up to speed up things. From reading through the docs, the allocator manages much better memory allocation wrt to cache locality (for a given running thread) - this is something which is not taken care of at all with the standard STL, I guess.
I increased a bit the problem benchmark size and I am getting a consistent edge of Intel TBB directly over the PSTL:
PSTL:
real 2m49.168s
user 22m8.761s
Intel TBB:
real 2m34.776s
user 20m21.620s
Intel TBB with Hoard:
real 2m0.320s
user 16m49.180s
PSTL with Hoard:
real 2m19.731s
user 19m19.224s
Intel TBB with Hoard (2nd run):
real 2m0.745s
user 17m2.926s
So Intel TBB vanilla is clearly best and Hoard gives nice performance gains.