Example of use of AutodiffStackSingleton (STAN_THREADS-related)

This is a question for @wds15. I’m trying to get things working with the new thread local storage code. I’m hitting a segfault right now:

Thread 6 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffebcaa700 (LWP 27308)]
0x00007fffea1fc768 in stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>::init () at ~/work/stan/httpstan/httpstan/lib/stan/lib/stan_math/stan/math/rev/core/autodiffstackstorage.hpp:122
122         if (!instance_) {
(gdb)

I’m assuming this is because I have failed to instantiate AutodiffStackSingleton as the documentation suggests.

Is there an example of how to instantiate this in a child thread? I didn’t immediately see a test which does this.

I think I found it. I gather I need to have a line like this:

stan::math::ChainableStack thread_instance;

Still getting the segfault but I’ll keep working on it. There are some odd interactions with Cython.

Yes, this is what you need - an instance of chainsblestack needs to be around. In fact the first instance created during the life time of the thread will own the ad tape resources of the thread.

I have some more details. The segfault seems to be occurring before I have a chance to do anything. Specifically, it occurs when this line is encountered on line 14 of stan_math/stan/math/rev/core/init_chainablestack.hpp

ChainableStack global_stack_instance_init;

I’m not sure how a segfault here is possible. Any ideas on how to debug this?

This is odd…you could make the respective define to avoid the init and make the init when appropriate. Should that help, then the link order could be an issue, but I doubt that.

Perhaps you could help me understand how this works at a higher level. Why isn’t instance_ defined in a child thread (i.e., how does this happen in C++)? Is there some way to, instead of segfaulting, have an orderly exit of the program with an error message?

The thread specific ad tape is stored in a thread specific pointer. For performance reasons that pointer is initialized to a compile time constant expression which must be the nullptr. So at the time point you create a thread any access to the ad tape will cause a segfault. Since every stan program is very sensitive to anything you put around the ad tape access, we cannot add here any checks which would allow a graceful exit. For the c++ programmer this means that he must be aware of that and whenever he creates a new thread, then he must immediately also create within that thread an instance of chainablestack. Doing so will initialize the ad tape by creating a new instance and setting the thread specific pointer to point to this instance. The ad tape is then owned by that chainablestack instance.

Things are a little different for the main thread. For the main thread we have in the unnamed namespace placed an instance of chainablestack. That one is intended to initialize the main thread ad tape.

From what you describe you could have an issue with concurrent creation of the global ad tape of the main process - which is something which should not happen to my understanding if the program is linked correctly.

I hope this helps you to debug httpstan.

1 Like

Here’s where I am right now.

Version STAN_THREADS Works?
2.19 No Yes
2.19 Yes Yes
2.20 No Yes
2.20 Yes No
2.20 with commit f39a325 reverted Yes Yes

Here “works” needs to be interpreted loosely as in “does not immediately segfault”.

I could use any suggestions you might have here. I can’t change how Python links things, so if that is indeed the source of the problem I will not be able to fix it.

From the gdb output it looks like there is some sort of static init order problem. I’m tempted to try to add a bit of indirection in the initialization (the ChainableStack global init) as suggested here: https://stackoverflow.com/a/34411563

Here’s my first attempt at implementing the suggested stackoverflow fix. I could use some help interpreting the compiler error message. This is init_chainablestack.hpp now:

#ifndef STAN_MATH_REV_CORE_INIT_CHAINABLESTACK_HPP
#define STAN_MATH_REV_CORE_INIT_CHAINABLESTACK_HPP

#include <stan/math/rev/core/chainablestack.hpp>

namespace stan {
namespace math {
/* Note use of unnamed namespace */
namespace {
/**
 * Initializes the AD stack for the main process. See
 * autodiffstackstorage.hpp for more explanations.
 *
 * Uses a function to avoid static init order issues.
 * See https://stackoverflow.com/a/34411563
 */

ChainableStack& GetStaticChainableStack() {
  ChainableStack instance;
  return instance;
}

ChainableStack global_stack_instance_init = GetStaticChainableStack();
}  // namespace
}  // namespace math
}  // namespace stan
#endif

The compiler complains:

init_chainablestack.hpp:23:69: error: no matching function for call to ‘stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>::AutodiffStackSingleton(stan::math::ChainableStack&)’
 ChainableStack global_stack_instance_init = GetStaticChainableStack();

and

autodiffstackstorage.hpp:93:3: note: candidate: stan::math::AutodiffStackSingleton<ChainableT, ChainableAllocT>::AutodiffStackSingleton() [with ChainableT = stan::math::vari; ChainableAllocT = stan::math::chainable_alloc]
   AutodiffStackSingleton() : own_instance_(init()) {}
   ^~~~~~~~~~~~~~~~~~~~~~
.../autodiffstackstorage.hpp:93:3: note:   candidate expects 0 arguments, 1 provided

The error message is confusing. I don’t see any call to a constructor with one argument.

I think what you do would require is a copy constructor which we have removed from the respective class…you would have to enable that one again to make this approach work.

If you suspect issues with the static order of initialization, then I would explicitly define during compilation the compiler flag ‘-DSTAN_MATH_REV_CORE_INIT_CHAINABLESTACK_HPP’. This way the global init Does not take place and you can place the instantiation of chainablestack in your worker threads wherever you want (making sure this happens before declaration of any var).

Is there code one can look at which is causing the trouble?

Edit: what is confusing to me is that when you do not define stan threads in 2.20 then things work for you. In that case the global init is just working fine, since this is what does the initialization of the ad tape of the main process. I am not sure what is going on here, sorry for the hickup…but its worth as we gain 20% speed here and this should be solvable…

1 Like

(Yes, I am very confused about why things work without STAN_THREADS.)

Good news. ‘-DSTAN_MATH_REV_CORE_INIT_CHAINABLESTACK_HPP’ works in the sense that there’s no immediate segfault. I can link everything and make non-sampling-related calls to things in the “stan_model.hpp” (like the function which returns parameter names).

I’m going to see if I can’t put in the instantiation of ChainableStack before I use AD-touching stuff.

One question, for now: What do you mean by “before declaration of any var”. Don’t you mean, before anything uses the AD Singleton?

You must do the initialization of the ad tape before you can touch anything reverse mode related of stan. So yes, the chainablstack must be instantiated before anything tries to access the ad tape. It’s not really a singleton…its more of a thread specific instance. Usually this means that there must be no stan math var variable declared, but anything else accessing the ad tape will break things as well.

Is there perhaps a very small change one could make which would use more of a RAII-style approach? I’m not suggesting this is something stan math would consider. I’m just trying to understand the options.

For example, here’s where things segfault (as they should) if I fail to instantiate a ChainableStack object (in vari.hpp):

126         return ChainableStack::instance_->memalloc_.alloc(nbytes);

Could one have a function call here instead which would do the instantiation if the user hasn’t done it already? Or is this precisely what you replaced for performance reasons?

Thanks. This is really helpful. I’ve done some further experimentation. It looks like PyStan and httpstan, because they wrap the C++ code (mainly stan::services functions) using Cython, don’t have the necessary control over when things appear in the generated C++ code.

I’m willing to maintain a very small patch to stan-math that changes TLS so PyStan and httpstan can work with it, even if it comes at a significant performance hit. We definitely want to use threads to do sampling in parallel. Is simply reverting the TLS v6 commit the easiest way to do this? Or is there perhaps something more clever?

Yes, avoiding anything being in the way here is what I did to make things fast.

Instead of patching stan math you should consider patching stan or whatever manages your threads. So you could stick a chainablestack instance into the model class, for example. What would be better is to have control over the thread creation and then place just after having created the thread inside the thread an instance of chainablestack.

I mean, you should have control over what is being executed within a thread and the first thing to do there is to instantiate a chainablestck.

Before you revert the tls change, I can take a look if you want. You would loose a lot of performance and the old tls stuff does not work with windows under the gcc 4.9.3 compiler while the new approach is faster and works with that setup under windows.

This all sounds good. (Although I think a httpstan-specific patch to stan math might be the easiest fix for 2.20)

We could also explore what is going on with the segfault. If I try to instansiate ChainableStack (with -DSTAN_MATH_REV_CORE_INIT_CHAINABLESTACK_HPP), I hit another segfault in the same place, autodiffstackstorage.hpp:122

122         if (!instance_) {

Is there anything further that can be done here? If it is a static init order “fiasco”, are any of the tips here useful? https://isocpp.org/wiki/faq/ctors#static-init-order

You should patch stan and not stan math if possible. Should,you really want to patch stan math then there are better options, i think.

Thinking about the static init order problem…then we could be hit by this here in a way i haven’t thought about… I am at a computer next week again to try it, but it would really help me if you can prepare some code which runs against the wall so that i can debug.

So what is currently happening is that the nullptr assignment is not done properly…but we need that to tell if the pointer is set. Instead we could go with a ref count static variable which is static thread local to the init function which we initialize to 0 and increment with each call of init. The initialization then only happens when that counter is 0. This will then allow you to make an if wrt to the ref count and not the pointer. Makes sense?

Could you try this, hopefully, fixed version:

I am not sure if the declaration at the very end is needed…I don’t think so, but as I am not on a computer I don’t know if it compiles or not…but you should get the idea of what is going on here. I really hope that this approach does still work on g++ 4.9.3 on windows… fingers crossed.

Let me know if that solves your trouble.

This looks promising. Thanks!

I made one obvious syntax fix: added bool to line 122, so static STAN_THREADS_DEF bool is_initialized = false;

Not able to completely compile though, getting this error:

undefined symbol: _ZN4stan4math22AutodiffStackSingletonINS0_4variENS0_15chainable_allocEE9instance_E

If I run nm model_2239cefc2a.cpython-36m-x86_64-linux-gnu.so | c++filt I get this (trimmed):

0000000000057630 W stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>::AutodiffStackStorage::AutodiffStackStorage()
0000000000057630 W stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>::AutodiffStackStorage::AutodiffStackStorage()
00000000000578b0 W stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>::init()
                 U stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>::instance_
000000000004b1d0 W stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>::~AutodiffStackSingleton()
000000000004b1d0 W stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>::~AutodiffStackSingleton()

edit add nm details

Looks like we still need the declaration…have a look at the updated version which I am running through our test system to see if unit tests still pass.

Edit: now things seem to work as tests pass (even on windows)

https://jenkins.mc-stan.org/blue/organizations/jenkins/Math%20Pipeline/detail/PR-1302/3/pipeline