Hi!
I was still a bit bothered by the performance offset between MPI and threading as I was expecting that threading is just as fast as MPI with the recent TLS faster merge. Thinking about it I figured that maybe the threading program has inefficient memory allocation since we do not use a so-called scalable memory allocator which is so strongly advised to be used for multi-threaded applications. The scalable memory allocator essentially ensures memory locality wrt to the core which requests the memory. The nice thing is that with the Intel TBB we can simply link the Stan program against their malloc library and that replaces all malloc calls with a scalable malloc. Here is the result from that:
Given that I have not changed any code, but just linked against the tbbmalloc_proxy library this is a really nice result.
It also shows that multi-core stuff is more involved than one might think as memory locality is more complex (and our current approach to grow a single memory pool may not be adequate in fact, since the memory pool is always in the vicinity of the “main” core - but not all the others).
The above is based on the TLSv6 PR which will hopefully land in Stan-math soon.
These results suggest that we should push the Intel TBB into Stan-math sooner than later - just for the tbbmalloc_proxy it is worthwhile to do so, I think.
Best,
Sebastian