Wrapping calls to ChainableStack

maintenance

#1

Hey devs (especially @Bob_Carpenter, @wds15, @syclik),

There’s some discussion in a pull request I wanted to pull out into its own discourse thread. I’ll link this there and that here at the tops of each, for ease of reference.

In @wds15 valiant effort to get C++11 threading support in the Math lib, it was necessary to change every usage of ChainableStack to call a function to get the context. But all of these calls are nested several layers of object access deep, which is a little bit of a code smell, plus a lot of them look the same.

I would propose a small refactor where we replace many of these with calls to some wrapper function. Bob mentioned there exists one like this for C-style heap arrays, but we could also look at using something like this allocator I wrote for another PR (that I haven’t had time to finish yet). This allows us to use std::vectors that are allocated on our autodiff heap, see this (simple, WIP) test.

Maybe this refactor should come first, which would allow for two things - less files changed by the c++11 threading refactor, and a much easier possibility of an ifdef wrapper that will allow us to support older Apple clang compilers. I would also be happy removing that latter thing as a goal, but I think the small refactor here is a good idea either way.

Thoughts?


Potentially dropping support for older versions of Apple's version of Clang
#2

I don’t think we need this refactor for being able to easily support old Apple compilers. There is only a single line of code right now which triggers the hickup. In order to support the old Apples we would have to something like

  static AutodiffStackStorage_t& context() {
#ifdef STAN_THREAD_LOCALS
    static thread_local AutodiffStackStorage_t ad_stack
#else
    static AutodiffStackStorage_t ad_stack
#endif
        = AutodiffStackStorage_t();
    return ad_stack;
  }

The STAN_THREAD_LOCALS can eventually be defined in automation depending on the compiler being used.

While I agree that a refactor would make sense to do, I would prefer to make it as a separate follow-up pull. To me it feels like a hold-up on our way towards a multi-threaded MPI enabled Stan 2.18.

I like the allocator thing you propose. It looks like it makes sense to do just that; although it would take me a moment to be able to roll it out (or are you offering your hand to do it?). The better use of existing facilities is maybe a compromise; though I still think we divert somewhat to some degree from the intent of the pull (I badly want that multi-core, multi-machine Stan like we all do…soon).


#3

Ah, fair enough. I didn’t mean to hold it up, I just thought you were saying we’d need ifdefs everywhere you had added the .context() calls.

Isn’t this independent of MPI? Won’t MPI give multithreaded, multimachine parallelism already?


#4

If we want this refactor now and put in the joint resources then this will also be done quickly. We can certainly do it if we think its worth it.

And yes, there is a single point of code which controls the thread_local thing. So supporting old Apple compilers does not need to drive this refactor is what I am saying.

MPI will give us the ability to use different processes - and it does not matter if this is on a given machine or/and across machines. However, MPI has additional communication & copying cost which we do not have for a threading thing. So the most efficient approach will be to link different machines with MPI but on a given machine we do threads. This will give the best possible performance.


#5

Cool. So yeah, I’m with you that that PR can proceed without this refactor. We can still use this thread to talk about the refactor.