I am a noob at stan, but I was curious enough to investigate the design of the overall language starting with the math library. My current question relates to the following text from the documentation (link so that you don’t have to do a bunch of clicking: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/autodiffstackstorage.hpp)
- The use of a pointer is motivated by performance reasons for the
- threading case. When a TLS is used, initialization with a constant
- expression at compile time is required for fast access to the
- TLS. As the autodiff storage struct is non-POD, its initialization
- is a dynamic expression at compile time. These dynamic expressions
- are wrapped, in the TLS case, by a TLS wrapper function which slows
- down its access. Using a pointer instead allows to initialize at
- compile time to
nullptr, which is a compile time
- constant. In this case, the compiler avoids the use of a TLS
- wrapper function.
So the part where it says that because Autodiffstorage is non-POD, its initialization is a dynamic expression at compile time makes intuitive sense. I’m not sure that I understand the pointer workaround though or why it works. I’m kind of a noob at this, so I was wondering if someone would be willing to explain. Perhaps I missed another part of the documentation where it was explained in greater detail. If so, please forgive my carelessness.
@wds is basically the only one who understands it in detail
What is your motivation in digging in?
To understand why we ended up with the design we have now required me to use goodbolt and look at assembler code.
In short, global tls things which are non pod are accessed with some nasty wrapper code by default. The wrapper code checks for a previous initialization and then returns the existing copy or creates one…but this check is always done. Using a pointer, which is a pod, avoids all of that at the cost of having to initialize things properly in order.
Thank you for your response! It’s greatly appreciated.
My motivation for digging in is a deeply unhealthy obsession with the minor details of everything I use that probably comes from years having algorithms break unexpectedly back when I was an undergraduate research assistant (and then having to explain to profs why I was unable to meet deadlines because I couldn’t figure out why certain algorithms were failing or weren’t converging) and just an insatiable level of curiosity. I get sucked into pretty much every rabbit hole I come across to the great annoyance of people around me.
It essentially all came down to benchmarking. This particular change was a result of at least 6 pull requests (https://github.com/stan-dev/math/pulls?q=is%3Apr+parallel+is%3Aclosed+faster+ad+tls see AD tls v1-v6) where we went back and forth to find a solution that was most efficient.
I wish I could mark two posts as being solutions, but I can’t so I will instead thank you both for your input. I will also try out godbolt because it seems really cool.
He’s not the only one. We try very hard to make sure there’s nothing in the math library that only one person understands. @rok_cesnovar was heavily involved in this and @stevebronder and @bbbales2 been heavily involved in our autodiff refactoring. I was the one who built all this stuff in the first place, and I still understand what we’re doing now pretty well.
To get background for this, I’d recommend:
our arXiv paper on autodiff in Stan,
- some tutorials on initialization order in C++ and what is/isn’t guaranteed, and
- some tutorials on singleton patterns and how to use that initialization order and protect your code in different situations.
Sadly, the language is undefined at the edges, so much of the defensive programming is to deal with undefined behavior in the language spec.
The reason we need thread-local storage is because our autodiff stack for reverse mode is a global static variable. Declaring the global autodiff stack storage variable as thread local makes a copy in each thread rather than sharing a single static variable across threads. As others have pointed out, the pointer is so we can be lazy (in the technical sense, not the slacker sense—this was a lot of work!) in initializing.