This weekend a few improvements on map_rect went into Stan-math develop which is why I got curious how the current threading and MPI implementation fare against each other. When rerunning the benchmark from my StanCon 2018 contribution, I am getting now this:
As you can see:
the 1 core performance for MPI and threading is essentially the same
the scaling with more CPUs is really good for MPI and is a bit less good for threading
Very likely, the threading case is suffering from not using a thread-pool which results in re-creation of threads (and the memory for the AD tree needs reallocation). The forthcoming Intel TBB will introduce such a thread-pool and provide better scaling here (hopefully).
As a side-effect of bringing down the performance penalty of turning on threading we are now using a different technique to store things. The consequence of this is that map_rect does work with threading on Windows using the RTools gcc 4.9.3 compiler.
If people out there could actually confirm once more that map_rect now works on Windows with the gcc 4.9.3 from RTools, then that would be very reassuring. I have tested it and it’s now part of our testing suite - so it should all work now. To try this out one would need to download the current cmdstan develop branch.
I was more referring to people on the forum who have their models up and running. I saw a while ago that some Windows users even went to the trouble of running things inside of docker. Anyway, if you want to get the above up and running you can use the files attached.
I was still a bit bothered by the performance offset between MPI and threading as I was expecting that threading is just as fast as MPI with the recent TLS faster merge. Thinking about it I figured that maybe the threading program has inefficient memory allocation since we do not use a so-called scalable memory allocator which is so strongly advised to be used for multi-threaded applications. The scalable memory allocator essentially ensures memory locality wrt to the core which requests the memory. The nice thing is that with the Intel TBB we can simply link the Stan program against their malloc library and that replaces all malloc calls with a scalable malloc. Here is the result from that:
Given that I have not changed any code, but just linked against the tbbmalloc_proxy library this is a really nice result.
It also shows that multi-core stuff is more involved than one might think as memory locality is more complex (and our current approach to grow a single memory pool may not be adequate in fact, since the memory pool is always in the vicinity of the “main” core - but not all the others).
The above is based on the TLSv6 PR which will hopefully land in Stan-math soon.
These results suggest that we should push the Intel TBB into Stan-math sooner than later - just for the tbbmalloc_proxy it is worthwhile to do so, I think.
developTLS: uses threading with the TLS PR (STAN_THREADS is on)
developMPI: uses MPI (but STAN_THREADS is off)
developTLS_tbbmalloc: same as developTLS, but this one links in the libtbbmalloc_proxy - this leads to a scalable malloc being used
Where is the malloc pressure? In intermediate Eigen and std vector structures? There shouldn’t be any from the autodiff stack as that just happens once.
The malloc is happening once - that’s good - but that malloc happens close to whatever core is the main one, but in a multi-threaded application the allocated memory needs to be close to the core which is actually doing the work. That’s a different thing and means that memory locality is more complex to achieve.
MPI is a pain to program up and threading is much easier to setup and use for users. MPI will probably never work on Windows; threading is a lot easier to setup and far easier to develop with.
And, I mean, MPI has a perfect scalable allocation…so all what we need to take into account are the specifics of threading to get good performance. I very much hope that a backend which uses the Intel TBB for every parts of the parallelization will get use even closer to the MPI performance.
As I am building an R package based on Stan for third party users, I would like a confirmation if threading is dependent (or independent) of openMPI. In other words, is some MPI related framework needed or installing rstan and setting the right flags is enough?
When two pieces of data required by different cores are in the same cache line the entire line gets copied. The articles about that point make it sound like this can devolve into thrashing at the cache level where the line is copied back and forth.