Thread performance penalty

Hi there!

Currently, the thread safety of stan-math comes with a hefty performance penalty which varies from model to model, but is between 5% - 65% slowdown. I hope we can get away with less of a performance hit and created an issue for this.

I have discovered that writing things slightly different in the code can lead to major improvements, see my prototype PR here. The PR uses a slightly different approach and with this approach I can show that

  • The new AD tape is just as fast as the old AD tape for the case of no threading
  • When threading is enabled we only loose ~5% at most in all of the examples

Obviously I think that we should seriously consider this.

The design of the new AD tape is based on the finding that if we declare a thread_local variable to be equal to a constant expression then the access to that variable is apparently much faster. When playing around with different versions I found that:

// this is like our old design and is rather slow as we initialize with an expression which requires dynamic allocations
static thread_local AutodiffStackStorage* ad_tape = new AutodiffStackStorage();

// this is the new proposal and it gives us almost no performance penalty as we are initializing with a constant here
static thread_local AutodiffStackStorage* ad_tape = 0;

The disadvantage of the new approach is that within each thread the AD stack has to be initialized first before it can be used. However, if we just link with each Stan program a file which contains

AutodiffStackStorage* ad_tape = ChainableStack::instantiate();

then the main process of that Stan program will have an initialized AD stack once execution reaches the main method. So there is no change for any single-threaded stan program due to this change. What does change is how threaded applications use Stan. These threaded application are now forced to call stan::math::ChainableStack::instantiate() for every new thread for which they want to use stan-math. So while the old design always initiaized an AD tape for any new thread this is now not anymore the case and the AD tape is only initialized if the application does that.

To me that does make sense as I find it a strong call to force any application which links against stan-math to have an AD tape ready for very thread they create. Not every thread can be expected to do AD work with stan-math.

I am planning to run one more benchmark which uses the parallel map_rect to hopefully see there that the new approach is just as beneficial. Comments are very welcome to this proposal. As we have burned ourselves in the past with different AD tape variations it would be great to ensure that all compilers we want to catch have been tested. On the current test pipeline everything passes which should imply all is good.

Best,
Sebastian

4 Likes

So this would make it so we can remove any extra Makefile flags required to turn on threading, right? That would be awesome as a reduction in code and testing complexity by itself :)

Why is this faster in the case where we do use map_rect and want to instantiate the ad_tape? Is there any way we can just have any calls to map_rect_concurrent initialize it?

It’s really too bad the math library has no notion of a lifecycle and thus an init phase right now. I will make another thread for that idea.

What do you mean “lifecycle”? If it’s what I can imagine you’re thinking, we can introduce it.

Honestly I think that we can remove the STAN_THREADS macro, yes. I would probably leave it in for at least one more release to make sure that things really work in the wild on all those compilers, platforms, etc.

The access to the thread_local tape gets faster. My hunch is that the C++ compiler can generate better code whenever you start off with a pointer being 0 in any thread. This possibly allows to place this pointer into the thread specific stack, but I am really guessing here. A colleague of mine was offering to look at this by looking at the assembler code generated.

Note that all tests pass for stan-math. The way I did is to place a call to ChainableStack::instantiate() inside the map_rect_reduce function. The reason to put it there is that the map_rect_concurrent lives in prim so that I do not want to pull in rev code there. However, the call in the map_rec_reduce is cheap since this is called relatively seldomly (in comparison to most other things).

Yup, we need that, I think.

In fact I do think that we can write things more efficient if the AD tape is just an instance. This instance can then be swapped, changed, etc…

Why would starting off 0 help with anything…? If your colleague has time on their hands that’d be awesome. This result goes pretty hard against my intuition that it shouldn’t matter what the value of the pointer starts out with, for example, if it’s allowed to change at any time and needs to be kept updated. I think we should try to figure out what’s happening here before we switch to this…

I am restarting the upstream tests right now, the failure there is spurious and my fault.

I’m surprised the mutex approach there is faster than the static thread local we had before. I will switch to commenting on the PR now since we’re at a pretty low level.

Which things can we write more efficiently? I had a exploding universe brain.jpg moment there imagining using that instead of nesting for ODEs and the complicated logic you were/are using for stitching the parallel map AD tapes back together. Is it possible to replace all of that?

… there is one observation which I made and will likely lead to a full explanation: There is a thread local storage (TLS) thing in C++ even before C++11 thread_local. This is the __thread keyword which is a compiler-specific extension supported by g++/clang++/intel compilers long before C++11. However, the __thread keyword can only be used with constant (non-dynamic) expressions. The way things are written right now you can trade thread_local vs __thread, because it is a constant expression. You cannot swap those whenever you write a dynamic expression there. So because we use a constant expression now, the compiler can make use of the __thread implementation which is much longer around and thus better optimized and better supported (I hope so). To me it does make sense that things are easier to handle with a constant expression there as the initialization is moved to the responsibility of the developer who will know much better when the initialization needs to occur.

Thanks for looking into this, but I do expect the upstreams tests to fail right now since we need to update the makefile such that the AD tape gets initializes (have a look at the changes to make/libraries and make/tests).

I am not even sure if we need to mutex in the instantiate function. It does make the function safer, yes, but we should think if we really need that protection there… mutexes are always expensive. However, the instantiate function should be called very rarely wrt to everything else.

I am a bit heasitant to do such radical changes, but what I have in my head could (potentially, maybe) make the nesting approach and the AD stitching for parallelization obsolete.

My idea is to have a queue of AD tape instances. The thread_local pointer of a given thread would only point to the end of that queue. When you then want to do nested AD one just pushes another AD tape on the queue and when you are done with the nested block you pop out the very last (it’s a FIFO, I guess). If we would have this design, then the parallelization could work such that we add a few AD tapes to the end of the queue and the final grad call does loop in reverse order over the queue and nested into that over each AD tape instance. If that works, then it would be a lot simpler (I hope) to the nesting logic.

The last thing to take care about is the memory pools which we grow right now once an leave them in memory for performance reasons. To deal with this we should probably make the addition and removal to the queue in such a way that once allocated memory will not be returned to the system ever.

I am somewhat reluctant to do such a radical change, but if you think the same we should probably give it a go with a prototype (and do some benchmarking).

I think I found it. Dump the following dummy code into https://godbolt.org/

#include <vector>
#include <iostream>

struct things {
    std::vector<std::size_t> stuff;
};

thread_local things* my_things = 0;

thread_local things* my_other_things = new things;

int square(int num) {
    my_things = new things;

    std::size_t a = my_things->stuff.size(); 
    std::size_t b = my_other_things->stuff.size();

    return num * a * b;
}

int main(int argc, const char* argv[]) {
  std::cout << square(10) << std::endl;
  return 0;
}

The go to line 15+16 on the C++ side where I access the two different TLS pointers. You see then the respective assembler code on the right being marked. The line 36-39 on the right corresponds to the respective assembler code to acces the TLS which was set to 0. For that one there is on call the the “TLS wrapper function” while in line 40-44 I am accessing the TLS variable which has been initialized with a dynamic expression and there you see a call out to the “TLS wrapper function”. Hence any call to the TLS with the non-constant expression gets slapped on a TLS wrapper function call as it looks.

2 Likes

[edit] just saw your godbolt post and I’m playing with it now…

But if __thread can only be used with constant expressions for some reason and then we trick the compiler such that we give it a dynamic expression later, are we not potentially shooting ourselves in the foot? I also see this quote about thread storage duration and worry:

If the thread that executes the expression that accesses this object is not the thread that executed its initialization, the behavior is implementation-defined.

We may be treading into undefined behavior land here. Sorry if this sounds alarmist; we’re just reaching the end of what I know about this part of C++ and I want to make sure I learn enough to responsibly contribute to this discussion.

As an experiment that might help illuminate things: What happens if we just change the type of this thread_local ad stack from a value to a pointer and initialize it in the current way?

I see what you mean - it definitely doesn’t call the wrapper function if you set it to 0 first. I guess it’s because somehow that signals to the compiler that we are going to ensure that each thread performs its own initialization of the pointer. If we take this approach, I think we have to be very careful to make sure each thread always calls instantiate() before doing any AD operations, which I think you’re doing right now in the PR, though perhaps only for the first template specialization of map_rect_reduce (for var, var - but don’t you need it for the other two specializations as well?).

So to be clear, right now the chainablestack_inst.cpp ensures that the main thread has an AD stack for its computations, right? Why don’t the child threads see that instance when they look to do the _instance != 0 check?

Correct. When we set the thread local pointer to 0 the compiler will avoid those nasty tls function calls which basically check if things are initialized or not. This comes at the price of having to ensure that we initialize ourselves in a given thread before any AD happens. And yes, only the map_rect_reduce step does the actual AD work and is the first piece of var touching code in a new thread as created by async in the map_rect_concurrent function.

When we have the Intel TBB things get more elegant, since we can then define things such that whenever a thread enters a so-called thread-arena we can act on that event. So we can always call instanciate whenever a thread enters into autodiff tasks. It’s neat!

Yes, correct. Only the main thread has a valid AD tape whenever the main function is reached.

The linker ensures that the code from chainablestack_inst.cpp is executed with the main thread only. All child threads have their own copy of instance_ which is 0 when the thread is created and it is up to the user of the stan-math library to initialize. If you do not initialize in a child thread, but start AD operations, then the program will segfault.

1 Like