This is for @bgoodri (or any other Windows build specialist who wants to contribute): Could you please try to get the Intel TBB working for stan-math with the gcc 4.9.3 from RTools? The branch you can try from stan-math is the feature/map_rect-tbb. Whenever you define -DSTAN_THREADS and -DSTAN_TBB then the Intel TBB will be used as the map_rect_concurrent backend. The tests which you need to get passed are:
However, I have to say that we have never gotten the thread local storage working on Windows with this compiler. So possibly it is enough to get the prim version working for the moment only (without defining STAN_THREADS in this case). Maybe things improve if we merge the current change of how we handle the thread local storage for the AD tape (which I shared in the last meeting)… don’t know yet.
Compiling the Intel TBB on Windows worked for me with a recent minGW compiler suite. If we get this working that would be great as this is probably the biggest hurdle to include the TBB into Stan.
Without having built the TBB on Windows myself, I think we can conclude that it is possible to achieve this, because:
Windows is a listed supported platform of the Intel TBB
The CRAN package RcppParallel does build the TBB on Windows using the RTools there, see their doc & their github
From what I recall that was the last bit which we needed to follow-up on. I will start a wiki page for the Intel TBB which will be based on our dependency checklist.
I believe this is correct but have not yet had a chance to verify further. I did get an example working with parallelstl which apparently used TBB but that only utilized stan/math/prim .
The makefiles work well, but are not super clean yet. They could work for windows, but I doubt that.
make tbb is all you need to do on a Mac (and presumably also on Linux).
I haven’t yet started to do fancy stuff like enforcing the exact same compiler is used for the TBB as for stan (it may actually work out of the box since the makefile variables should be taken over).
I am very curious on that windows thing… my guess is that we need to modify the makefiles as I think that is what RcppParallel did.
I have on a local branch the TBB running and I am able to turn on threading and the thing still runs ~12% faster than our current develop without threading (turning threading used to slow down things)!!! I will run a few more tests, but it looks to me as if we can just turn on threading. The bit what seems to make the difference is the TBBs malloc replacement which can deal a lot better with those massive amounts of tiny allocations which we tend to do.
No, I mean that we can simply ship a thread safe stan-math - we did not do that up to now, because we pay ~20% performance for making things thread safe. This performance penalty is basically gone with the new design.
For MPI map_rect the implementation would right now alway split first by MPI and a nested map_rect call would then use threads if enabled.
Posting here so I don’t forget it. The msys2 people (like the RStudio people, but unlike me) have succeeded in getting TBB to build on Windows with gcc