Autodiff memory storage arena

Hey,

How does the autodiff storage arena work? I see maybe two logically separate (but probably in the same underlying chunk of raw memory?) arenas:

      static std::vector<ChainableT*> var_stack_;
      static std::vector<ChainableT*> var_nochain_stack_;
      static std::vector<ChainableAllocT*> var_alloc_stack_;
      static stack_alloc memalloc_;

      // nested positions
      static std::vector<size_t> nested_var_stack_sizes_;
      static std::vector<size_t> nested_var_nochain_stack_sizes_;
      static std::vector<size_t> nested_var_alloc_stack_starts_;

I’d guess the first vector is used to figure out what to call chain() on, and so the 2nd makes sense as a place not to put things to call chain() on. What is the var_alloc_stack_ for? And why are there nested stack sizes and starts?

IIRC this level of detail wasn’t mentioned in the autodiff paper but I might be mis-remembering - feel free to point me to some other docs.

Thanks,
Sean

The nesting is for nested autodiff. That means that we start a new stack, but just use the same memory for it. It’s used in the ODE solver to compute Jacobians in a nested fashion. Then it’s all popped off the stack and freed to go back to the next level of autodiff. That’s described in the paper, but maybe not well, because I didn’t want to dive into crazy low-level details and obscure the bigger picture.

And yes, the second stack is for variables that don’t get autodiffed. We use that in matrix operations to reduce the number of virtual function calls. I believe that was also discussed, though again probably too vaguely, in the autodiff paper.

Those std::vector objects work as usual and encapsulate their own mallocs to follow the RAII pattern (Eigen matrices work the same way). Those keep track of the variables so we know how to work back through the stack. I’m not actually sure that we need to keep that var_nochain_stack_. The var_stack_ is traversed in the derivative propagation (reverse step) for autodiff.

The sizing of everything’s complicated, because we use an increasing sequence of underlying memory blocks rather than copying everything into a bigger array. That might not have been the best choice, but when we profiled, it didn’t provide extra overhead at run time because the stacks are what’s being traversed, not the arrays directly. And we already blow memory locality because there’s no way to preserve it in an expression graph.

You can see why this all needs better doc!

3 Likes

I think this is a basic C++ question but I’m not sure how to google it. What are these lines doing after the struct declaration?

    template<typename ChainableT, typename ChainableAllocT>
    std::vector<ChainableT*>
    AutodiffStackStorage<ChainableT, ChainableAllocT>::var_nochain_stack_;

Whole file here: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/autodiffstackstorage.hpp

These are explicit template instantiations. They’re on the wilder ends, but they’re still explicit template instantiations.

For a Stackoverflow answer: http://stackoverflow.com/questions/2351148/explicit-instantiation-when-is-it-used

It should also be in the C++ Templates book in the office.

I’m not sure if @seantalts’s question was answered, but what is the var_alloc_stack_ for? I’m having trouble understanding the files: https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/chainable_alloc.hpp
https://github.com/stan-dev/math/blob/develop/stan/math/rev/core/chainablestack.hpp

1 Like

Rob and Sean and I talked in person and there’s a new issue assigned to me to go doc it so everyone else can figure it out. Short answer:

  1. stack of variables to call chain() on
  2. stack of variables that don’t have chain() called
  3. stack of variables that have their constructors called

Case (3) is necessary so that we can use things like LDLT member variables in the vari classes. Case (2) is for multivariate outputs where all the work is put in one element’s chain() method to reduce virtual function calls.
Case (1) is the usual case.

@Bob_Carpenter was this ever doc’d?

Don’t think so but it’s probably changing soon anyway withSebastian’s work on making it compatible with threading.

that’s where I am really not understood well yet… I do not intend to touch the ad memory stuff in a deep way. Everything stays the same - the only change is that independent parts of the AD tape get written in parallel during the forward sweep of reverse mode. Once the independent parts are over, we end up with the old data structure which is exactly the same as in good old times…

Weren’t you also thinking of redoing the nested autodiff? Or is that phase 2?

ok, nested is going to change yes… but nested is barely used in the code base. We use it only for ODE and algebraric_solver things. This is not phase 2 - nesting has to change to make things work.

Only vaguely in the code.

It’s used in all of our functionals, meaning in every step of HMC in Stan where we compute gradients. Now it’s a trivial use there, but the functional gradient() operates nested so it’s robust.