TBB and 2.21

Hi!

We agreed on to include the Intel TBB into Stan-math for 2.21 only if we can include a benefit for the users by providing a scheduler aware map_rect. In case we don’t get codes ready for map_rect, then 2.21 can be the last fully GPL-2 compatible release.

I was wondering if we could go for the following: Given that we only use the TBB for map_rect with 2.21 it should be possible to write the code such that the TBB is mandatory only if STAN_THREADS is turned on, but the TBB is actually not used whenever STAN_THREADS is not set. Thus, the 2.21 release would only require the TBB whenever threading is used.

To be clear - in going forward the TBB will be made mandatory for any setting of STAN_THREADS. However, integrating it like this will make 2.21 semi-compatible with the GPL-2 (just delete the TBB and don’t use threading) and moreover, the interfaces get a bit more time to deal with 2.21 wrt to the TBB.

Integrating the TBB in this manner may require an additional if-def here or there… not many for sure, but it is absolutely manageable.

(I am assuming that once we have threaded multi-chain warmup in stan, then most users will switch to threaded Stan anyways as a default).

Opinions?

Tagging @rok_cesnovar, @syclik, @bbbales2, @seantalts, @Bob_Carpenter, @Stevo15025

Sebastian

If this can be done with manageable if-defing I would say go for it since it would be a “compromise”:

  • we have the last version that can be GPL-2 compatible with reasonable effort
  • we have TBB in the codebase which means interfaces can test it

For me personally the latter is the more important one.

Would you do that in the already open PR?

There is a little over 10 days left until the feature freeze, so if you need any help let me know. I have cleared the OpenCL backlog for this release and am awaiting review and focusing on getting the GLM support in stanc3 but other than that I have some time to help out if I can.

Nope… the open PR only deals with makefiles. We leave that as is I would say. Thus, if you want to rip out the TBB, the user can define TBB_LIBRARIES to be empty and then the makefiles do not build the TBB anymore (I think).

I am working on another PR which introduces the TBB ready for use with the reverse mode. I will include if-defs there as discussed. I should have time today for some work on this… let’s see.

Sebastian

1 Like

I don’t see the value of a final pre-GPLv2 release. Is there anyone that’s going to benefit from exactly one more release? Or is there some kind of critical bug fix and some users that need that licensing?

I agree with Bob. Even if it’s easy to make it optional, since there’s no huge reason to do this, just get it in.

This was the common agreement during the last stan-math meeting. Please refer to the meeting notes from that meeting. I see your point, but given we agreed on that we go along this path at the moment…unless @syclik is fine with going all in now as you suggest.

However, I am really far… and would appreciate if now some eyes could look at what I am proposing and maybe someone can help me typing out even more tests (@rok_cesnovar is reviewing I suppose, so let him be the reviewer and not a writer of this) .

So the branch feature/intel-tbb-init branches off the intel-tbb-lib branch and it includes the following extras to check:

  • stan/math/prim/scal/fun/init_threadpool_tbb.hpp tested with test/unit/math/prim/scal/fun/init_threadpool_tbb_test.cpp … the function is doced and mostly tested. The idea is that this function must be called by the interfaces before firing off a chain. This will initialize the TBB threadpool. In case this function is not called, then the TBB initializes itself with its defaults. Everything will run, but the STAN_NUM_THREADS would be ignored.

  • stan/math/rev/core/init_chainablestack_tbb.hpp adds a so-called observer to the TBB system. The observer is notified whenever a thread enters the threadpool. Upon notification the AD tape for the given thread is initialized such that we do not need to deal with AD tape initialization ourselves. As long as a TBB thread is used everything is fine. I left for debugging couts in there. This is tested in test/unit/math/rev/mat/functor/gradient_test.cpp. The gradient test would fail if the AD tape would not have been setup.

So please have a look, @rok_cesnovar, at this. If you approve this design I will finish it up and I take any hand who wants to help me on that (there are some todos in the code).

Once this is done, then refactoring the map_rect will be very easy:

  • we will just switch to use the tbb::parallel_for… and that’s it for stan-math
  • as a second thing we need to make within cmdstan a call to the stan::math::init_threadpool_tbb() function which will initialize the TBB threadpool with the right number of threads => all interfaces have to do this in the future to get STAN_NUM_THREADS as maximal concurrency; otherwise the TBB default will be active (all cores available).

It’s too late for me to do an issue + the PR for tonight… but all in all we are on track for 2.21 unless the design is not good to your eyes, so please let me know what you think ASAP. Thanks.

Sebastian

I’ll take a look at this tmrw. May be good to open up a draft or WIP PR for discussion

I will take a look tomorrow before noon. Too little focus to do it thoroughly right now.

Judging just from your post it seems its what we agreed at the Math meeting (discussion around a helper function).

EDIT:

Quick comment not on the general design: I would prefer not using getenv. In this case I would prefer ifdefing and just assigning a default value when the define is not present. That way you also dont need to do the lexical parse as this becomes a compile time error.

I am not sure if I understand your comment. How do you suggest should STAN_NUM_THREADS be honored if you do not want to use getenv? Should the interfaces do that?

Yeah, never mind, its a environment variable not a define like STAN_THREADS. My brain just jumps to the conclusion that its a compile time define whenever I see STAN_*. I will call it a day. Start with fresh eyes tomorrow.

Ah…ok…will open a or tomorrow…maybe some fresh us eyes look while we sleep…

WIP PR is open, see below.

@rok_cesnovar I did some simplifications compared to yesterday.

Things to consider design wise from my view are listed in the Todo section of the PR. More discussion there, I suppose.

Thanks for letting me know—I hadn’t seen that. I wasn’t trying to open an already closed decision.

1 Like

Here are the notes from the 9/19/2019 meeting.

There are a few good fixes in this release. But no, there wasn’t a real big value in a final pre-GPLv2 release, which was why we were ok either going forward with this or not. @wds15, I don’t think the decision was properly reflected: we are ok including TBB with the release if we can make use of it by the code freeze date. If we can not, then we’ll have it with part of the next release. (We don’t need a final pre-GPLv2 release.)

Ok. I think we are basically there. I will then take away those additional if-def guards which right now make the TBB only mandatory with STAN_THREADS. There are a few if-defs which can go away then.

It looks like we have it close to ready - with 9 more days we are good by the time relatively sure.

Hi @syclik!

Just to make sure this doesn’t pass by without notice: As the TBB is now fully integrated into Stan-math, this turns Stan-math into a non header-only library! The TBB must be linked against in any circumstances (with/without threading). Doing so allows for the cleanest design.

The doc for this updated in the places where I came across.

It may appear unnecessary at the moment, but we can only by always linking in the TBB we can get around pouring quite some ifdefs into the code base which we decided against. Since we will almost for sure end-up with threaded warmup in the very near future, I would anyway expect that we will be in a threading-is-the-default world very soon. Once we are there, the TBB facilities are in heavy demand for any Stan program.

Sebastian

Sundials beat the TBB to the punch in making Stan math require linking.

And like Sundials, it looks like I didn’t need to do anything on my end on OS X—just pulling the latest seems to work.

Well…the difference to sundials is that if you do not use sundials functionality in your Stan program, then things still work even without linking to sundials. This is why pystan can get by without linking the sundials libraries. With the tbb this is different. For any Stan program you must link the tbb or it won’t work. That’s new.

And I spend a lot of time on making everything smooth.

Why would it require to run a job that doesn’t utilize threads? Is that necessary?

Yes…you need the tbb for all programs. It makes life for us developers a lot easier.

Edit: and as we will likely have threaded warmup in the very near future, I very much hope so, we will anyway live in a thread only world once we are there…

1 Like