Currently, the thread safety of
stan-math comes with a hefty performance penalty which varies from model to model, but is between 5% - 65% slowdown. I hope we can get away with less of a performance hit and created an issue for this.
I have discovered that writing things slightly different in the code can lead to major improvements, see my prototype PR here. The PR uses a slightly different approach and with this approach I can show that
- The new AD tape is just as fast as the old AD tape for the case of no threading
- When threading is enabled we only loose ~5% at most in all of the examples
Obviously I think that we should seriously consider this.
The design of the new AD tape is based on the finding that if we declare a
thread_local variable to be equal to a constant expression then the access to that variable is apparently much faster. When playing around with different versions I found that:
// this is like our old design and is rather slow as we initialize with an expression which requires dynamic allocations static thread_local AutodiffStackStorage* ad_tape = new AutodiffStackStorage(); // this is the new proposal and it gives us almost no performance penalty as we are initializing with a constant here static thread_local AutodiffStackStorage* ad_tape = 0;
The disadvantage of the new approach is that within each thread the AD stack has to be initialized first before it can be used. However, if we just link with each Stan program a file which contains
AutodiffStackStorage* ad_tape = ChainableStack::instantiate();
then the main process of that Stan program will have an initialized AD stack once execution reaches the
main method. So there is no change for any single-threaded stan program due to this change. What does change is how threaded applications use Stan. These threaded application are now forced to call
stan::math::ChainableStack::instantiate() for every new thread for which they want to use
stan-math. So while the old design always initiaized an AD tape for any new thread this is now not anymore the case and the AD tape is only initialized if the application does that.
To me that does make sense as I find it a strong call to force any application which links against
stan-math to have an AD tape ready for very thread they create. Not every thread can be expected to do AD work with
I am planning to run one more benchmark which uses the parallel
map_rect to hopefully see there that the new approach is just as beneficial. Comments are very welcome to this proposal. As we have burned ourselves in the past with different AD tape variations it would be great to ensure that all compilers we want to catch have been tested. On the current test pipeline everything passes which should imply all is good.