This is for @bgoodri (or any other Windows build specialist who wants to contribute): Could you please try to get the Intel TBB working for
stan-math with the gcc 4.9.3 from RTools? The branch you can try from
stan-math is the
feature/map_rect-tbb. Whenever you define
-DSTAN_TBB then the Intel TBB will be used as the
map_rect_concurrent backend. The tests which you need to get passed are:
However, I have to say that we have never gotten the thread local storage working on Windows with this compiler. So possibly it is enough to get the
prim version working for the moment only (without defining
STAN_THREADS in this case). Maybe things improve if we merge the current change of how we handle the thread local storage for the AD tape (which I shared in the last meeting)… don’t know yet.
Compiling the Intel TBB on Windows worked for me with a recent minGW compiler suite. If we get this working that would be great as this is probably the biggest hurdle to include the TBB into Stan.
Thanks a lot! Let me know if you need more input.
Without having built the TBB on Windows myself, I think we can conclude that it is possible to achieve this, because:
- Windows is a listed supported platform of the Intel TBB
- The CRAN package
RcppParallel does build the TBB on Windows using the RTools there, see their doc & their github
From what I recall that was the last bit which we needed to follow-up on. I will start a wiki page for the Intel TBB which will be based on our dependency checklist.
I believe this is correct but have not yet had a chance to verify further. I did get an example working with parallelstl which apparently used TBB but that only utilized stan/math/prim .
I have the TBB now on a local branch in Stan-math. Basic setup with makefiles is in place (and works for MacOS and presumably Linux).
Should I push it to the repo so that others may have a test run on windows in terms of compiling it?
That would be good. I’ll try to test it on Windows today.
Here you go: https://github.com/stan-dev/math/tree/feature/intel-tbb-lib
The makefiles work well, but are not super clean yet. They could work for windows, but I doubt that.
make tbb is all you need to do on a Mac (and presumably also on Linux).
I haven’t yet started to do fancy stuff like enforcing the exact same compiler is used for the TBB as for stan (it may actually work out of the box since the makefile variables should be taken over).
I am very curious on that windows thing… my guess is that we need to modify the makefiles as I think that is what RcppParallel did.
Any news on Windows & the TBB?
I have on a local branch the TBB running and I am able to turn on threading and the thing still runs ~12% faster than our current develop without threading (turning threading used to slow down things)!!! I will run a few more tests, but it looks to me as if we can just turn on threading. The bit what seems to make the difference is the TBBs malloc replacement which can deal a lot better with those massive amounts of tiny allocations which we tend to do.
Do you mean for MPI or for something else? I’m still curious about how all of the parallelizations are going to interact with one another.
No, I mean that we can simply ship a thread safe stan-math - we did not do that up to now, because we pay ~20% performance for making things thread safe. This performance penalty is basically gone with the new design.
map_rect the implementation would right now alway split first by MPI and a nested
map_rect call would then use threads if enabled.
Posting here so I don’t forget it. The msys2 people (like the RStudio people, but unlike me) have succeeded in getting TBB to build on Windows with gcc