Parallel STL

Getting the PSTL to work is a bit of a mess: You need the Intel TBB which is setup to build on Mac with the clang compiler from Apple. However, for the PSTL you have to use a OpenMP 4.0 compliant compiler… which is not the Apple clang (only supports 3.X). So I used a g++ 7 from MacPorts to compile the Stan program.

Sure. but I noticed that when loading custom memory managment libraries improves performance under threading. In the meantime I found Hoard (http://hoard.org/) which is a GPLed memory allocator designed for threaded applications… and it gives me about 25% performance gains literally just by loading this library and not changing any line of my codes. So we should probably document for users how to set this up to speed up things. From reading through the docs, the allocator manages much better memory allocation wrt to cache locality (for a given running thread) - this is something which is not taken care of at all with the standard STL, I guess.

I increased a bit the problem benchmark size and I am getting a consistent edge of Intel TBB directly over the PSTL:

PSTL:
real	2m49.168s
user	22m8.761s

Intel TBB:
real	2m34.776s
user	20m21.620s

Intel TBB with Hoard:
real	2m0.320s
user	16m49.180s

PSTL with Hoard:
real	2m19.731s
user	19m19.224s

Intel TBB with Hoard (2nd run):
real	2m0.745s
user	17m2.926s

So Intel TBB vanilla is clearly best and Hoard gives nice performance gains.