I’d guess the first vector is used to figure out what to call chain() on, and so the 2nd makes sense as a place not to put things to call chain() on. What is the var_alloc_stack_ for? And why are there nested stack sizes and starts?
IIRC this level of detail wasn’t mentioned in the autodiff paper but I might be mis-remembering - feel free to point me to some other docs.
The nesting is for nested autodiff. That means that we start a new stack, but just use the same memory for it. It’s used in the ODE solver to compute Jacobians in a nested fashion. Then it’s all popped off the stack and freed to go back to the next level of autodiff. That’s described in the paper, but maybe not well, because I didn’t want to dive into crazy low-level details and obscure the bigger picture.
And yes, the second stack is for variables that don’t get autodiffed. We use that in matrix operations to reduce the number of virtual function calls. I believe that was also discussed, though again probably too vaguely, in the autodiff paper.
Those std::vector objects work as usual and encapsulate their own mallocs to follow the RAII pattern (Eigen matrices work the same way). Those keep track of the variables so we know how to work back through the stack. I’m not actually sure that we need to keep that var_nochain_stack_. The var_stack_ is traversed in the derivative propagation (reverse step) for autodiff.
The sizing of everything’s complicated, because we use an increasing sequence of underlying memory blocks rather than copying everything into a bigger array. That might not have been the best choice, but when we profiled, it didn’t provide extra overhead at run time because the stacks are what’s being traversed, not the arrays directly. And we already blow memory locality because there’s no way to preserve it in an expression graph.
Rob and Sean and I talked in person and there’s a new issue assigned to me to go doc it so everyone else can figure it out. Short answer:
stack of variables to call chain() on
stack of variables that don’t have chain() called
stack of variables that have their constructors called
Case (3) is necessary so that we can use things like LDLT member variables in the vari classes. Case (2) is for multivariate outputs where all the work is put in one element’s chain() method to reduce virtual function calls.
Case (1) is the usual case.
that’s where I am really not understood well yet… I do not intend to touch the ad memory stuff in a deep way. Everything stays the same - the only change is that independent parts of the AD tape get written in parallel during the forward sweep of reverse mode. Once the independent parts are over, we end up with the old data structure which is exactly the same as in good old times…
ok, nested is going to change yes… but nested is barely used in the code base. We use it only for ODE and algebraric_solver things. This is not phase 2 - nesting has to change to make things work.
It’s used in all of our functionals, meaning in every step of HMC in Stan where we compute gradients. Now it’s a trivial use there, but the functional gradient() operates nested so it’s robust.