Map_rect threading

wds15 · May 20, 2019, 7:08am

Hi!

I was still a bit bothered by the performance offset between MPI and threading as I was expecting that threading is just as fast as MPI with the recent TLS faster merge. Thinking about it I figured that maybe the threading program has inefficient memory allocation since we do not use a so-called scalable memory allocator which is so strongly advised to be used for multi-threaded applications. The scalable memory allocator essentially ensures memory locality wrt to the core which requests the memory. The nice thing is that with the Intel TBB we can simply link the Stan program against their malloc library and that replaces all malloc calls with a scalable malloc. Here is the result from that:

wallclock_threads-v2_tbbmalloc

Given that I have not changed any code, but just linked against the tbbmalloc_proxy library this is a really nice result.

It also shows that multi-core stuff is more involved than one might think as memory locality is more complex (and our current approach to grow a single memory pool may not be adequate in fact, since the memory pool is always in the vicinity of the “main” core - but not all the others).

The above is based on the TLSv6 PR which will hopefully land in Stan-math soon.

These results suggest that we should push the Intel TBB into Stan-math sooner than later - just for the tbbmalloc_proxy it is worthwhile to do so, I think.

Best,
Sebastian

Topic		Replies	Views
Call for testing of upcoming 2.21 - including faster map_rect and new compiler General	19	1063	October 14, 2019
Bug in map_rect with threading in Stan 2.18.0 Announcements bug	1	1241	December 22, 2018
MPI Stan + cmdstan General	8	1182	June 15, 2018
Map_rect concurrent about to land Developers math	34	2118	July 23, 2018
Running MPI General performance	9	1949	August 24, 2018

Map_rect threading

Related topics