Question about AutodiffStackStorage.hpp in Stan Math library

kthompson · August 22, 2020, 5:21pm

I am a noob at stan, but I was curious enough to investigate the design of the overall language starting with the math library. My current question relates to the following text from the documentation (link so that you don’t have to do a bunch of clicking: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/autodiffstackstorage.hpp)

The use of a pointer is motivated by performance reasons for the

threading case. When a TLS is used, initialization with a constant

expression at compile time is required for fast access to the

TLS. As the autodiff storage struct is non-POD, its initialization

is a dynamic expression at compile time. These dynamic expressions

are wrapped, in the TLS case, by a TLS wrapper function which slows

down its access. Using a pointer instead allows to initialize at

compile time to nullptr, which is a compile time

constant. In this case, the compiler avoids the use of a TLS

wrapper function.

So the part where it says that because Autodiffstorage is non-POD, its initialization is a dynamic expression at compile time makes intuitive sense. I’m not sure that I understand the pointer workaround though or why it works. I’m kind of a noob at this, so I was wondering if someone would be willing to explain. Perhaps I missed another part of the documentation where it was explained in greater detail. If so, please forgive my carelessness.

bgoodri · August 22, 2020, 5:27pm

@wds is basically the only one who understands it in detail

kthompson · August 22, 2020, 5:28pm

Thank you!

wds15 · August 23, 2020, 8:46am

What is your motivation in digging in?

To understand why we ended up with the design we have now required me to use goodbolt and look at assembler code.

In short, global tls things which are non pod are accessed with some nasty wrapper code by default. The wrapper code checks for a previous initialization and then returns the existing copy or creates one…but this check is always done. Using a pointer, which is a pod, avoids all of that at the cost of having to initialize things properly in order.

kthompson · August 24, 2020, 5:09pm

Thank you for your response! It’s greatly appreciated.

My motivation for digging in is a deeply unhealthy obsession with the minor details of everything I use that probably comes from years having algorithms break unexpectedly back when I was an undergraduate research assistant (and then having to explain to profs why I was unable to meet deadlines because I couldn’t figure out why certain algorithms were failing or weren’t converging) and just an insatiable level of curiosity. I get sucked into pretty much every rabbit hole I come across to the great annoyance of people around me.

rok_cesnovar · August 24, 2020, 5:13pm

It essentially all came down to benchmarking. This particular change was a result of at least 6 pull requests (https://github.com/stan-dev/math/pulls?q=is%3Apr+parallel+is%3Aclosed+faster+ad+tls see AD tls v1-v6) where we went back and forth to find a solution that was most efficient.

kthompson · August 24, 2020, 5:30pm

I wish I could mark two posts as being solutions, but I can’t so I will instead thank you both for your input. I will also try out godbolt because it seems really cool.

Bob_Carpenter · September 2, 2020, 6:04pm

He’s not the only one. We try very hard to make sure there’s nothing in the math library that only one person understands. @rok_cesnovar was heavily involved in this and @stevebronder and @bbbales2 been heavily involved in our autodiff refactoring. I was the one who built all this stuff in the first place, and I still understand what we’re doing now pretty well.

To get background for this, I’d recommend:

our arXiv paper on autodiff in Stan,
some tutorials on initialization order in C++ and what is/isn’t guaranteed, and
some tutorials on singleton patterns and how to use that initialization order and protect your code in different situations.

Sadly, the language is undefined at the edges, so much of the defensive programming is to deal with undefined behavior in the language spec.

The reason we need thread-local storage is because our autodiff stack for reverse mode is a global static variable. Declaring the global autodiff stack storage variable as thread local makes a copy in each thread rather than sharing a single static variable across threads. As others have pointed out, the pointer is so we can be lazy (in the technical sense, not the slacker sense—this was a lot of work!) in initializing.

Topic		Replies	Views
Example of use of AutodiffStackSingleton (STAN_THREADS-related) Developers	30	1139	August 13, 2019
Neat autodiff C++ package Enoki Developers math	3	1150	September 9, 2019
Thread performance penalty Developers	10	2218	January 18, 2019
Preconditions and postconditions of functions in the Math library Developers	5	577	December 6, 2019
Wrapping calls to ChainableStack Developers maintenance	4	755	April 4, 2018

Question about AutodiffStackStorage.hpp in Stan Math library

Related topics