This weekend a few improvements on
map_rect went into Stan-math develop which is why I got curious how the current threading and MPI implementation fare against each other. When rerunning the benchmark from my StanCon 2018 contribution, I am getting now this:
As you can see:
- the 1 core performance for MPI and threading is essentially the same
- the scaling with more CPUs is really good for MPI and is a bit less good for threading
Very likely, the threading case is suffering from not using a thread-pool which results in re-creation of threads (and the memory for the AD tree needs reallocation). The forthcoming Intel TBB will introduce such a thread-pool and provide better scaling here (hopefully).
As a side-effect of bringing down the performance penalty of turning on threading we are now using a different technique to store things. The consequence of this is that
map_rect does work with threading on Windows using the RTools gcc 4.9.3 compiler.
If people out there could actually confirm once more that
map_rect now works on Windows with the gcc 4.9.3 from RTools, then that would be very reassuring. I have tested it and it’s now part of our testing suite - so it should all work now. To try this out one would need to download the current cmdstan