(older) parallel AD tape ideas

Hi @andrjohns, @stevebronder, @bbbales2 !

As discussed yesterday, here is some material as to what I promised.

Old parallel AD tape ideas of mine are in an old branch of mine on stan-math. Essentially I had in mind to define what I referred to as “scoped chainable stack”, defined here:

The idea of that thing is that you can execute any function under AD “monitoring” (aka recording to a tape stuff while executing) with an object of this and the ChainableStack going with it will always be stored in that object (regardless of the thread this thing runs). This works by activating / deactivating in a given thread these ChainableStacks. See for the activate/deactivate here

Doing so allows one to run the forward pass in parallel while the reverse pass has been designed to run serially (as the sub tapes are copied to the main tape). In fact, the implementation above creates an AD tape per thread and once the parallel region is processed, the sub AD tapes are copied to the main (parent) AD tape. This copying can be avoided by inserting a special node into the parent AD tape which stores the child tapes and runs upon a grad call the grad call on the childs. I did avoid that at the time in order to escape the synchronization issue with the upstream operands upon calling grad (the adjoints of the input operands would have to be able to receive concurrent additions to their adjoint).

Maybe the branch can be useful for our goal to avoid too many copies of vars when going parallel in order to remove memory overhead.

The other thing we talked about was memory bottlenecks in the current brms use of reduce_sum. I created an issue for how one can remedy them by using slicing, see here:

In order to see how things behave without memory issues, just remove the intercept random effect. To see that there are basically no memory related slowdowns, replace the normal endpoint wit a Poisson endpoint (the issue is a slight variation of the brms threading vignette which uses Poisson, see https://cran.r-project.org/web/packages/brms/vignettes/brms_threading.html )

As always with this, should we embark on refactoring parallel AD along the lines above, I think that we should first hack up a prototype to see its benefits. Basically models which are right now memory bound should benefit from it. These are normal endpoints or bernoulli endpoints (use of normal_lpdf or bernoulli_logit). Thus, the task 0 is to define Stan models which we want to run faster in parallel like the ones mentioned. Then we try to refactor in a way to avoid memory limitations we have now, and then we hopefully have an even faster parallel AD.

That’s it for now.

Sebastian

3 Likes

Thanks Sebastian! Will add it to the reading list

I actually wrote a (lengthy) design doc on these ideas:

Go to the “Step 2: Independent AD” section. I hope that makes sense… this design was obviously not used eventually. The main criticism - as I recall - was that the reverse pass does not run in parallel… and another issue was probably that what I asked for required a major overhaul over many things, but maybe now we are closer to this design anyway.

2 Likes

I like the idea of building around the linear model likelihood evaluation in brms. I’m not sure they’re the best justification for doing reverse mode in parallel because this is the type of model that we can do using nested autodiff in forward mode. It’s still a good example cause if we do any good here we help a lot of models, but the bigger gains will come from situations where nesting reverse mode in a forward pass would be inefficient.

I also like the idea of doing parallelism via scoped/nested stacks like in @wds15 's previous design doc (without copying stuff back to the main stack). We can presumably write the first versions of this without threading at all, and just make sure it is general enough to come back to later. Maybe this can also hook into how the GPUs interact with the main autodiff stack.

So the simplest thing is if we have something like this:

int N;
T1 A[N];
T2 B[N];
for(n in 1:N) {
  B[n] = g(A[n]);
}
var lp = f(B);

and our intention is to replace the for loop with a function like:

B = parallel_map(g, A);

that is intended to do the same thing. In this case T1 and T2 are supposed to be generic Stan types.

In this case the autodiff stack for the for loop execution would look like this (each iteration of the for loop creates a vari for the operation):

B[1] = g(A[1]) // There will be a vari in the stack corresponding
               //   to this operation
B[2] = g(A[2])
...
B[N] = g(A[N]) // Nth g evaluation
f(B)   // evaluation of f

To do reverse mode autodiff, we reverse iteration from the end of that a back to the top calling chain every time. The chain of each operation is responsible for accumulating the adjoints at its output into the adjoints at its input via adjoint_input = adjoint_output^T * jacobian_of_f

In the second case, the main autodiff stack could look something like:

B = parallel_map(g, A)
f(B)

And then what would be cool is if there was a way to allocate N nested stacks that the vari associated with B = parallel_map(g, A) would know about. So we would also have a bunch of stacks (each separate code block indicating a separate stack):

B[1] = g(A[1])
B[2] = g(A[2])

B[N] = g(A[N])

And when the parallel_map vari chains, it knows to make each of these nested stacks chain as well.

So maybe psuedo code for the function parallel_map would look like:

struct parallel_map {
  int N_stacks;
  nested_stack **stacks;
  
  template <typename F, typename T>
  auto operator()(const F& f, const std::vector<T>& A) {
    std::vector<decltype(f(A[0]))> B;
    // allocate a new nested stack for each input
    N_stacks = A.size();
    stacks = arena_allocate<nested_stack*>(A.size());
    
    for(size_t n = 0; n < N; ++n) {
      // allocate new stack in arena
      stacks[n] = new nested_stack();
      
      {
        // Use RAII to become this stack until the end of the block
        use_nested_stack my_stack(*stacks[n]);
        B.push_back(f(A[i]));
      }
    }
    return B;
  }

  void chain() {
    for(size_t n = 0; n < N; ++n) {
      // Use RAII to become this stack until the end of the block
      use_nested_stack my_stack(*stacks[n]);
      grad(); // chain the stack
    }
  }
};

I am not sure what happens with set_zero_adjoints, recover_memory, set_zero_adjoints_nested and recover_memory_nested because I am not 100% how those work at a baseline lol.

I kind of like the idea of these nested stacks possibly being specialized to other types of parallelism. Like we replace:

stacks[n] = new nested_stack();
      
{
  // Use RAII to become this stack until the end of the block
  use_nested_stack my_stack(*stacks[n]);
  B.push_back(f(A[i]));
}

with

stacks[n] = new tbb_stack(f, A[n]);

or

stacks[n] = new gpu_stack(f, A[n]);

or

stacks[n] = new mpi_stack(f, A[n]);

This is still assuming that every stack in Stan only gets grad’d by one thread. All parallelism comes through varis that manage more nested stacks.

Realistically I’m not sure it’s the right abstraction for all of these though. For instance, it breaks up the work into chunks without letting the TBB manage it.

In Sebastian’s previous proposal he talked about making use of thread local stacks.

Skimming the TBB docs for parallel_for, maybe something like:

struct parallel_map {
  thread_local_nested_stack *stack;
  
  template <typename F, typename T>
  auto operator()(const F& f, const std::vector<T>& A) {
    std::vector<decltype(f(A[0]))> B;
    // allocate a new nested stack for each input
    stack = new thread_local_nested_stack();

    parallel_for(blocked_range<size_t>(0, N), 
      [&](const blocked_range<size_t>& r) {
        // There's probably some nice RAII way to do this without `auto my_stack`?
        auto my_stack = stack.use();

        for(size_t i=r.begin(); i!=r.end(); ++i) 
          B[i] = f(A[i]);
      }
    );

    return B;
  }
  void chain() {
    // I really don't know how to code this
    for_every_thread_in_thread_pool({
      stack.grad();
    });
  }
};

And if the threadlocal thing isn’t feasible we could do something with concurrent_vector<T> types:

struct parallel_map {
  thread_local_nested_stack *stack;
  
  template <typename F, typename T>
  auto operator()(const F& f, const std::vector<T>& A) {
    std::vector<decltype(f(A[0]))> B;

    concurrent_vector<nested_stack*> stacks_tbb;

    parallel_for(blocked_range<size_t>(0, N), 
      [&](const blocked_range<size_t>& r) {
        nested_stack *stack = new nested_stack();
        use_nested_stack my_stack(*stack);

        for(size_t i=r.begin(); i!=r.end(); ++i) 
          B[i] = f(A[i]);
        }

       stacks_tbb.push_back(stack);
    );

    stacks = arena_allocate<nested_stack*>(stacks_tbb.size());
    N_stacks = stacks_tbb.size();
    for(size_t i = 0; i < stacks_tbb.size(); ++i)
      stacks[i] = stacks_tbb[i];
  }

  void chain() {
    parallel_for(blocked_range<size_t>(0, N_stacks, 1), 
      [&](const blocked_range<size_t>& r) {

        for(size_t i=r.begin(); i!=r.end(); ++i) 
          nested_stack *stack = stacks[i];
          use_nested_stack my_stack(*stack);
          grad();
        }, simple_partitioner());
  }
};

What’s weird there is that whatever choices we made during the forward pass about allocating calculations to different stacks we’re stuck with on the reverse pass.


To handle the race condition on the reverse pass, I wonder if we could do something like create a threadlocal copy of the input? So to be clear, if we have:

B[1] = g(A);
B[2] = h(A);

on the autodiff stack and we want to do the chains of both of those operations together, there is a race condition accumulating the adjoints of A.

So a threadlocal copy might look like:

const auto* tl_arg = to_thread_local(arg);

with the idea that if arg is anything related to doubles, it just gets passed through, but if it is anything with var s in it, it gets copied to a threadlocal version of the same variable. If we’re only running with one thread, it can just pass the var objects through as well.

This is different than how we handled this type of thing before. What we did before with reduce_sum was on for every chunk of work we made a deep_copy of all shared input varis. So for every parallel work chunk, we accumulated adjoints back into a parallel work specific vari.

In this, we would accumulate everything on the reverse pass back into threadlocal versions of the inputs. So programmatically it would kinda look like everything was accessing the same inputs but they’d actually be different (and we’d still need to code up the propagation of the adjoints back from the threadlocal thing to the original thing).

This seems a bit weird to me but these deep_copy s were expensive and we want to minimize how many of them we do.

I’m adding usual suspects @tadej and @rok_cesnovar. Also adding @jamesyang916, author of FastAD (look at the benchmarks here)

The thing I’m talking about here is similar to what @wds15 previously proposed. The reason we’re talking about it at all is @andrjohns started a design document https://github.com/stan-dev/design-docs/pull/28 where it would be advantageous to have parallelism in reverse mode (to avoid possibly expensive nested Jacobian calculations). If there are other ideas of doing parallelism aside from the nested stack, that is fine for discussion too.

1 Like

One question about the pseudocode: regarding the function f that gets passed into parallel_map's operator(), are we assuming that f does not depend on any vari's? Just as an example:

std::vector<var> A = ...;
var y = ...;
auto f = [&](const auto& a) { return y += a; };
auto B = parallel_map()(f, A); 

You would have a race condition on y since that’s shared on the main stack, right? Is this something to worry about for your use-case?

Yeah good catch. We’re making that assumption since all functions in Stan require you to pass in all your arguments, but it won’t be a safe assumption for long.

There was a design doc for closures: https://github.com/stan-dev/design-docs/blob/master/designs/0004-closures-fun-types.md, and the first implementations of stuff like that has shown up over here: https://github.com/stan-dev/math/pull/2094

I don’t think we’ve decided exactly how the Stan compiler will compile Stan closures to C++ things but there’ll be some extra functions or something to get information about this.

I have a hard time wrapping my head around nested sometimes, but can the node in the parent AD tape reference multiple stacks? I think Ben and I are on similar wavelengths here, though I think I’m flipping his idea inside out. If we had some RAII mechanism for collecting groups of stacks that are going off in parallel, as long as we know those specific stacks are independent of one another then couldn’t we run those in parallel as well? Is that independence a much harder thing to assume than I’m giving it credit? In my brain space I’m thinking like

// in some function...

// Let the process know we are about to make a group of stacks
start_nested_group par_group;

parallel_for([&](const yada& blah) {
  // this nested stack knows it's in par_group
 // Each nested stack tell's par_group it exists
  start_nested nested_stack(par_group);
  // do parallel stuff...

});
// End of function, par_group cleans up and makes a 
//  callback on the stack to trigger grad()
//  for each nested_stack in parallel
}

Where the callback could be something like

class start_nested_group { 
  // Pointer to each nested stack is pushed over here
  // fyi arena_allocator pushes the vec's elements
  //  onto our stack
  std::vector<ChainableStack*, arena_allocator> stack_group;

  // Imagine some constructor or whatever here
  // ...

  // On cleanup, pop off a callback
  //  to run the grads in the group
~start_nested_group() {
  // Copy stack into lambda
  reverse_pass_callback([stack_group]() mutable {
   parallel_for([&](const yada& blah) {
      // Call each group's reverse pass in parallel
      // This grad() isn't actually a function 
     //   but we could make it
      stack_group[i].grad();
   });
  });
}
}

I was being very lazy about shared memory in what I wrote. So if node == vari, then yeah, I am assume any thread can read any vari in any stack and I’m using that (when threads make local copies, they read the memory themselves and make a copy of it rather than being sent a message).

We just have to have a mechanism to synchronize a reduce on the forward pass (if we have one) and a reduce on the backwards pass over the input adjoints. Other than that, I think we’re always assuming independent calculations.

I like what you wrote. It is simpler, though I don’t think it tries to take advantage of any thread local stuff (which might be more trouble than it’s worth but I don’t know).


We need to somehow copy arguments passed to the lambda into a local bit of the stack first.

Maybe we should do:

par_group.parallel_for([](const yada& blah) {
  // When this code is called, par_group has automatically
  //   initialized a nested stack and `blah` has been copied over
});

We don’t want to run on destruction. We want to run whenever our chain is called just like any other vari.

And we’ll also need a reduce over whatever copies of shared arguments are made. Like:

class start_nested_group : public vari {
  ...
  void chain() {
    // Done in parallel
    parallel_for([]() {
      stack_group[i].grad();
    });

    // In its simplest form, done by one thread
    for(size_t i = 0; i < stack_group.size(); ++i) {
      input_adjoints += stack_group[i].input_adjoints();
    }
  }
};

(Edit: changed nested_stacks variable to stack_group)

Yeah note my above destructor just calls a reverse_pass_callback()so it doesn’t execute anything but puts that callback on the stack

Oooh, okay I get it. Well I think this could just adopt the reverse_pass_callback schtick of being a vari itself (so by being allocated it puts itself on the vari stack and the destructor does nothing and its chain will get called).

Oh yeah v true!

Whow… what a post… I like a lot of this!

I am not sure I fully follow along the parallel_grouped_nested thing, but the idea to expand the notion of nested makes sense. What confuses me about that way you, @stevebronder, proposed is that the thing goes out of scope during the forward pass of AD, but there is some additional work to be done when doing the reverse pass (but maybe this is my lack of knowledge on the reverse_pass_callback-thing).

In any case, the key - from my view - is that during the forward sweep of AD we must be able to control to what instance of the AD tape things are written. Right now we always write to the thread local AD tape instance. It still makes sense to write to a thread local thing - it’s just that we must have control to what instance we are writing to. That’s the key to record parts of a larger function in smaller pieces.

The second problem which needs to be solved is synchronisation during the reverse sweep. If we run parallel grad calls, then the input operands will receive updates in parallel which must be synchronised. One possibility to solve all this could be

  1. Create for each thread we use a sub AD tape
  2. Copy the input operands as the first thing to each of the sub AD tapes
  3. Within a given thread the input operands within the thread are being used during the forward pass
  4. During the reverse sweep we can then call in parallel grad in each sub AD tape up until we reach the copied input operands.
  5. The grad method of the input operands on each sub AD tape must be called sequentially in order to synchronise the update to the original input operands.

With this scheme we would copy the input operands only once per thread (instead of every time we execute things in some thread).

The TBB has a number of useful things in this context which we should take advantage of. For example, a an enumerable_thread_local for lazy thread local instances (lazy in the sense that only those threads where actually an instance is requested you are creating one, very neat), or combinable which is great for summing up stuff in multiple threads.

How should we proceed on this? We should align on the general concept and then create a (dirty) prototype of this would be my suggestion.

Ah, and yeah, the concept to go towards many smaller AD tapes which are linked together should enable to have parts of the AD tape live on the GPU or any other device.

Check how reverse_pass_callback works: https://github.com/stan-dev/math/blob/develop/stan/math/rev/functor/reverse_pass_callback.hpp#L38

It’s actually a factory function (so we get template deduction) that builds a vari that gets allocated on the autodiff stack in the background.

So there are chunks of work, there are thread local things and I know what these concepts are – what does ‘instance’ mean in this case? And why do we want to control where things are written? Isn’t this TBB’s job to send work to appropriate threads?

So in this case you are organizing the nested stacks around the number of threads in the thread pool, not the chunks of work that are sent to the threadpool.

I think the simpler way is writing this around the chunks of work sent to the threadpool, in which case if we did:

par_group.parallel_for([](const yada& blah) { ... });

the argument deep_copies to avoid the race condition can happen in the parallel_for function. Like it could be:

nested_group {
  ...

  template<typename F>
  void parallel_for(const F& f) {
    copy_arguments_from_parent_stack

    tbb::parallel_for([]() {
      auto stack_ptr = new nested_autodiff_stack;
      use_nested_stack(*stack_ptr);
      auto local_args = deep_copy(args);

      f(local_args);
   });
  }
};

I think you’re pointing to the thread_local stuff that I don’t know anything about which would enable moving from AD stacks broken up by chunk of work to AD stacks organized by thread. The second will imply less copying to avoid race conditions. One problem with threadlocal is that our forward pass decides the scheduling for the reverse pass (if one thread does less work in the forward pass, it does less on the reverse pass too)

I would like to see:

  1. A nested reverse mode autodiff stack using one thread and no TBB (to get a picture of how all the zeroing, memory recovering, etc. will work first)

  2. A more complete comparisons in the difference in nested stacks attached to every chunk of work vs. thread-local nested stacks. This very well may just be a quick a dirty implementation both ways.

  3. An MPI design – would it favor chunks of work or would it look more like a threadlocal implementation? I wonder if the easier MPI design might look like the threadlocal thing.

At the moment we have one instance of the AD tape for each thread. So if we run 6 threads we will have 6 instances of a ChainableStack. Within each thread you always access the AD tape through the thread local pointer ChainableStack::instance_ which points to the thread specific AD tape. So it is the running thread context which determines what AD tapes things get recorded to - and that’s currently hard-coded and cannot be changed. When I say “instance” of the AD tape must be controllable, I mean that I would like to be able to have control over the actual AD tape instance where things get recorded to. In practice this means that we simply make ChainableStack::instance_ point to a fresh AD tape instance (under our control), then execute things (such that stuff gets recorded on to the fresh AD tape), and then we swap in back the previous AD tape. This way we can control where things are stored to during the forward sweep.

The TBB just dispatches work to threads, yes… but we want the work which happens within that thread to record it’s operations to a custom AD tape (and not the one which sits in the thread).

I think it’s more efficient to target threads instead of tasks of work, yes; this is mainly due to the need to deep_copy which should not happen by work chunk (which are many), but by thread ideally.

Yes, that’s correct. It could be an issue, but maybe it’s not. In case it’s an issue one can probably work with it.

I like going simple first (without threads). The only thing I would suggest to also bake into that prototype is a thread safe reverse sweep which requires synchronization of the adjoint accumulation into the input operands.

Do you have time to work on this? I may find the time to resurrect some of my ideas, but I am not keen to commit to any timeline. So if you are up for fast progress, then feel free to go ahead.

Along the way we should also define a benchmark which we want to beat. Maybe the current reduce_sum on a memory bound problem vs the new thing.

That make sense.

@andrjohns was the one who started talking about more parallel, so my hope is he does it lol.

Still deep in var<mat> land with @stevebronder, but I assume we’ll both review any sort of parallel reverse mode pull like this.

Hmm, I would be happy just matching reduce_sum on something it does, though reduce_sum has that cache locality thing that helps it in a sometimes not so insignificant way.

It might be easiest to just copy-paste the reduce_sum code and use it to develop this, just so everything is tested, etc. Like reduce_sum_2, just for development purposes.

Or @andrjohns already had a parallel_map function developed in a non-trivial way. That would work too.

I’d personally be interested in seeing if this helps parallel_map. @andrjohns if you have time to work on this and want someone to shoot ideas off of def hit me up!

Yeah, that has upside and reduce_sum doesn’t really. The whole point to reduce_sum was designing around limitations that wouldn’t exist anymore.

Yeah for sure, I’m happy to have a first crack. It’s a bit lower–level than what I’ve worked with before, but I should be able to get it started in the right direction!

I’m taking a break from dev stuff for a week or two while I catch up on some papers (oh academia), but I’ll send off a bunch of questions when I get into it

2 Likes

Wooo woooo!

Good luck!

2 Likes

👏I👏 believe👏in👏YOU!!👏

2 Likes