Parallel autodiff v2

wds15 · December 7, 2019, 7:36pm

Hi!

I am about to resubmit a restructured design rfc for parallelism for the reverse mode autodiff as we discussed earlier this year. My ideas so far only revolved around running the forward sweep of reverse mode autodiff in parallel (when the ad tape is constructed). When working at this I figured that it could be possible to run the reverse sweep again in parallel when we do keep track of things carefully… I think. I am wondering if we want to go with a single big rfc or better two independent ones for this. Anyone an opinion on this?

While I have a working prototype for the forward parallel story, I only have at the moment a sketch of a design refinement which should allow parallel backward sweeps for reverse mode. I am happy to make a longer post on he forum first about the backward parallel sweep story (I am not yet sure about it). The thing is that opting for a parallel backward sweep as well easily impacts some design choices for the forward parallel thing (but we could generalize a forward parallel implementation). In short, things will take longer, but it could be worth it - don’t know. Attached is a dump of my notes on this…not really clean!

Best,

Sebastian

bbbales2 · December 7, 2019, 10:58pm

Yes! I am all for this. I really like the idea of trying to engineer in the reverse mode stuff in the context of language features and stuff.

I tried to sit down and write out a basic design that would accomplish this as well. In terms of the AD stack, I think the conclusion I came to would be it needs to be more flexible in terms of having different stacks n’ such, which I think was a big part of your original RFC.

I like the idea of explicitly splitting out proxy nodes in the AD tree. I hadn’t thought about things that way.

I got it all written out and then realized how much work it would be and got scared haha. It’s tons of work. I don’t know how coherent it all is, but here goes:

Summary

The goal here is to describe the simplest extension to the Stan autodiff system to allow for parallellism in both the forward and reverse mode passes.

Outline of proposed operations

A basic autodiff computation might look like:

a = 1
c = f(a)

and the goal is to farm out f to another computers.

Assume that f, if executing on another computer, does not initiate any remote computations itself. Only the main process is allowed to initiate remote computation.

The forward sequence of operations in the hypothetical parallel system would be:

(main process) assign a = 1
(main process) tell worker to call f with argument a
(worker process) allocate autodiff stack
(worker process) call f with a
(main process) collect result + reference to worker autodiff stack
(main process) assign c = result

The reverse sequence of operations could be:

(main process) increment adjoint of f with adjoint of c
(main process) tell worker to call reverse mode on specified worker autodiff stack with adjoint of f
(worker process) put adjoint of f into output vari, call chain on autodiff stack
(main process) collect adjoint of input of f from worker
(main process) accumulate adjoint of input into adjoint of a

(edit on posting to Discourse: I think the input adjoints here behave like the proxy nodes in @wds15’s post)

When the main process deallocates its stack it should be able to tell the workers to deallocate all worker stacks corresponding to the main process stack as well.

Operations of the main process and the worker processes

The autodiff system needs to accomplish these things:

Forward-mode calculations
Reverse-mode calculations (via chain)
Clearing the autodiff stack
Nested autodiff

These are all things done by the main process. If worker processes are limited to executing function calls which are part of a larger autodiff, then I think autodiff on the nodes needs to be able to:

Allow for creation of a separate autodiff stack for each forward-mode remote function call
Allow for the reverse mode calculations be carried out individually for each of these separate autodiff stacks
Allow the memory of each of these stacks to be free’d separately by the main process

Extensions to support the main process

I was going to call this section, “Extensions to autodiff” but these things might be “Extensions to the language” or “Extensions to the Stan runtime” or something.

There should be mechanisms for:

Defining data to be shared beyond the lifetime of a single forward/reverse mode sweep.
Calling a function on a remote process.
Tracking memory allocated remotely.

Extensions to support the worker processes

Same preamble as for the main process.

There should be mechanisms for:

Handling data that doesn’t change
Accepting function calls
Allocating new autodiff stacks and switching to them (at which point new varis would be allocated on this new stack)
Calling chain on individual autodiff stacks
Allowing for individual stacks to be de-allocated

Drawbacks

The major drawback to this design is that the work distribution mechanism is very primitive. There’s no job scheduling, and jobs cannot recursively launch more jobs (or anything elaborate like that).

stevebronder · December 8, 2019, 2:01am

Like a callback to cleanup memory allocated somewhere else? Should a subprocess be in charge of cleaning up it’s own memory?

tbb handles nested parallelism pretty well. dunno if you still want to use MPI but it may be better to handle all of this in the one computer case with N threads then have an MPI layer on top of that to dish out multiple chains etc. Get 100% usage on one comp then talk about the cluster case

wds15 · December 8, 2019, 11:31am

Hi!

This all (@bbbales2 ) looks like it’s going to be geared for MPI - which I really don’t think is an efficient route for this task. MPI could be used to link different chains possibly, but using it to represent a distributed AD tree will be super challenging.

Reflecting a bit more I think we should go about this in these steps:

Start with a refactored parallel design RFC which is based on my earlier version this year
- this will start to allow having multiple AD tapes during parallel phases which we always merge into a single one for the moment at the end of the parallel regions.
- I will include the bits and pieces for parallel reverse sweeps in the appendix such that we can ensure that the design of this first step will be extensible with the future where we also run the backward sweep in parallel
The next bigger RFC can handle a scattered representation of the AD tree. This is really another story to be worked on as a next thing. This will make things more efficient for the parallel forward sweeps, but the main point about doing this is to be able to put parts of the AD tape onto the GPU. That would be a big thing for GPU things. Maybe MPI integrates with this as well (not sure).

The reason why I think we should go with a simpler (forward parallel sweeps only) is:

It’s already complicated! We should do this first as it is a huge step forward already (5x speedups on 6 cores). Earlier this year we got lost in a huge discussion which totally derailed. Keeping it simpler should increase chances of getting it through.
Creating the “proxy” AD nodes is necessary - but for pulling this off one would need a tight integration with the parser, because only the parser knows for which AD nodes we need a proxy and for which ones we do not need one. I am all for this, but I have no experience with the parser.
The key is to design the current step such that all the groundwork on the AD structure is done now; at least for the purpose of running the backward sweep in parallel as well (for which parser integration work is needed and implementation of special AD nodes, but there would not be any more additional structural changes to the AD tape design).

I hope that makes sense to everyone.

Sebastian

Bob_Carpenter · December 8, 2019, 4:23pm

Absolutely. The only requirement is that the topological sort of the AD expression graph is respected. More precisely, every expression needs to have its adjoint fully evaluated before propagating to its operands. As long as you respect that, order doesn’t matter and reductions can proceed in parallel.

This is related to what we do with MPI in the forward pass—we do a reduction of a subgraph. This is called “checkpointing” in the autodiff literature.

You just have to make sure that the topological sorting properties are respected. The same conditions apply as for what they call “checkpointing” in the autodiff literature, which is what we do with MPI calls in the forward pass—that is, independently reduce a subset of a graph to its partials.

In essence, our nested autodiff uses “different stacks” that just happen to share a data structure.

+1 to making it easy to understand and review and code.

Are you thinking automating this in the parser or having language features for parallel reductions like MPI? We don’t ahve the option of analyzing the autodiff stack the way some of the static-taping autodiff systems do.

wds15 · December 9, 2019, 12:30pm

What my thinking is to use the parser to extract the information about dependencies. The performance comes from running reductions on large data-structures - say a big vector. This is chunked into smaller pieces. Assume that for each reduction of each block there is another global variable involved. Thus per chunk I will have the operands of the smaller chunk of the big vector and I will need what I call a “proxy” copy of the global variable. The information what needs a proxy and what not cannot be deduced from the types. It must come from the parser (or user, but that is not an option). This structural information should be available in the parser, I believe.

Yes - and in the map_rect call we know exactly which operands lead into which expression which makes it easy to implement. The crux is that the structure we impose makes things awkward to use and kills vectorisation performance. Giving up the structure solves the usability issue (big point), gives us good vectorisation performance (another big win), but we loose the structural information which creates the need to approach it differently as proposed.

I am afraid it will still take considerable brain drain to go through it…

Sebastian

bbbales2 · December 9, 2019, 7:28pm

That’d depend on how it was implemented, but memory is deallocated when the autodiff stack goes away. In this model there is still one main autodiff stack (all the other autodiff stacks connect to it) and so the lifetime of the main stack determins the lifetimes of the little stacks.

Yes, let’s do this.

Yes the situation we have before us is that in like 8 months TBB got into stan-dev/math and that doesn’t feel very efficient, but I think this discussion is on track (2nd part of: https://github.com/stan-dev/design-docs/pull/5#issuecomment-498430451)

I did write this with MPI in mind. I tried to leave out explicit MPI API references on purpose, but it still leans one way :D.

Let’s roll with this. If we think threading is better programmatically then we need to think about:

What would a threaded version of this look like?
What opportunities do we have to take advantage of the extra flexibility of threading?
What does this mean in terms of the Stan language?
What does this mean in terms of the problems we’re trying to parallelize?

So my understanding of comparing threaded programming to MPI is that in threaded programming threads are easy to create and destroy and processes are not.

What this means in practice is that for MPI if you have a pile of work, you divy that up amongst a fixed number of workers with some sort of scheduler. In a threaded programming environment, you can conceptually make a thread for every little piece of work you want to do (though it might be implemented with a worker pool and scheduler behind the scenes).

And you can emulate MPI behavior in threaded land by using thread-local storage and such (giving everyone a thread-id), but you have more flexibility too (though we still might need thread-local cause how the autodiff library is structured)

So with that in mind, this is the MPI version of the process outline above:

(main process) assign a = 1
(main process) tell worker to call f with argument a
(worker process) allocate autodiff stack
(worker process) call f with a
(main process) collect result + reference to worker autodiff stack
(main process) assign c = result

And then in a threaded implementation we change that a bit because threads are easy to create:

(main thread) assign a = 1
(main thread) allocate new autodiff stack
(main thread) create copy of a on new autodiff stack (new vari)
(main thread) create worker thread
(main thread) tell worker thread to call f with argument a using new autodiff stack
(worker thread) do work
(main thread) copy output of f into original autodiff stack with special parallelization vari (that points back to the newly allocated stack)
(main thread) assign c = output

Okay that is simpler cause we don’t have to copy things around. The main process can directly see the autodiff stack that the worker thread populated.

This also makes free-ing a little simpler. The main thread doesn’t need to communicate to other processes to deallocate memory. It can just do the free-ing itself.

That also makes reverse mode simpler, because the same thread that did forward mode doesn’t need to do the reverse mode part.

And looking at this we might see another opportunity! If allocating new autodiff stacks is thread-safe, then there’s no reason a worker thread couldn’t create more worker threads and so on, which gets back to @stevebronder’s comments about nested parallelism.

The counterargument to adding this complexity would probably come from the application level. The types of parallelization I want to do in models is apply/map/parallel-for. I don’t have a need for nested parallelism, but I could be wrong.

Given all that, we can say if we go with the non-recursive calls, we can map that to MPI or threading without much difficulty and we aren’t losing modeling flexibility.

Then the question becomes what does allocating a new autodiff stack really mean? Can this be the same operation in a threaded environment and an MPI environment? I don’t see why it wouldn’t be, but maybe I’m missing something.

The basic thing you want to do is a map + accumulate sorta thing right? And that’s why we only need forward mode now?

Efficient to code or efficient to run? I don’t think we’ve answered the first and I don’t believe the second.

I don’t think this is true. Here are all of our distributions parallelized with OpenMP: [WIP] openMP by bgoodri · Pull Request #800 · stan-dev/math · GitHub. It was simple and done and easy to understand but did not make it through.

I think this gets easier if we generate more buy in from the other Stan devs before more work starts, especially the ones that might be reviewing this (@yizhang or @syclik). We just gotta do things in a way so that everything works. This doesn’t mean any single person has to code everything – just gotta make sure nobody is gonna be fundamentally unhappy with the result. It may be impossible in the end, but I don’t think we’re there yet.

wds15 · December 9, 2019, 8:27pm

No, we are not yet quite there - and it makes me worry a bit that we seem to derail on threading vs MPI.

One really strong reason for me to go with the TBB was that we can actually forget about threads. We can focus on composing our algorithms using the building blocks from the TBB which are very expressive. With the TBB you don’t

worry where your data/vars/you name it is (like in what process) - it’s just there in your memory
nor do you worry about in what state a remote process is - you just fire off work
and finally you can totally forget about the low level stuff even threading brings - we merely focus on chunking our to-be-done work into tasks which get scheduled by the TBB automatically into the threadpool

The magic of the TBB is obviously not totally straight forward to exploit when it comes to doing reverse mode autodiff. There is a bit more bookkeeping to be done - and that is exactly what is already successfully implemented in a prototype.

Should we have a meeting to discuss how to move forward?

I recall that in a 2-3 month ago in a stan-math meeting with @syclik and @rok_cesnovar (at least these two) we agreed on that I will resubmit my parallel design RFC. @syclik wanted more clearly spelled out requirements for the AD tape and @rok_cesnovar asked for a graphical representation. I think I basically have both now ready and I would submit it these days… but if we should better talk before starting this, then let’s do so? Maybe we quickly align a path forward at the Stan meeting on Thursday? Probably it needs a dedicated meeting besides the general discussion round, but let’s see (sometimes the agenda is empty, sometimes its not).

We should definitely not use again 8 month for these things - totally agreed (some of the wait is due to myself being slammed with non-Stan work).

Sebastian

bbbales2 · December 10, 2019, 6:27pm

But any threading library interacts with the autodiff stack somehow. There’s no way TBB knows enough about our autodiff stack to avoid race conditions on memory and such.

If the threads touch vars, we need to worry.

And what is this stuff and how do I understand it outside the context of TBB. This is what I want to know.

wds15 · December 10, 2019, 10:00pm

The TBB has been designed to add parallelism to an existing source base. As such it provides a lot of hooks and callbacks which make it easy to plug it into our existing source base.

One example is the way we right now initialize the autodiff thread local AD tape, see here. The TBB has a so called observer object which allows a callback upon the event “thread enters TBB scheduler” - and that makes it straightforward to have a readily initialized AD tape sitting in each thread managed by the TBB.

Now, the TBB itself is designed around task objects which represent a chunk of work. With the scheduler observer stuff I mentioned what already works now is if you dump tasks to the TBB which do nested AD operations within a task since then the full AD tape resides in a single thread. Cool, eh?

The more interesting thing is, of course, to evaluate a big function which we evaluate with multiple ~~threads~~ tasks. The first step is to forget threads, think in terms of tasks. The task class is an abstract class, see here, which the user derives from. The key thing it defines is an execute function which represents the work to be done. Now, with the building blocks I propose we can wrap this execute call such that we can inject the right setup code for having the thread local pointer point to the right AD tape which belongs to a bigger task group.

At least this could be a way forward, I think.

Unfortunately, the super-comfy algorithms like parallel_for from the TBB don’t allow you to use your own task (at least I haven’t seen the hook); but that’s not really an issue, since it is very easy to wrap the AD tape handling around the actual work which is to be done.

Using the parallel_for from the TBB means that you get your work chunked into independent pieces which are sized according to whatever the scheduler thinks is best given the current load. These pieces will be executed in some thread. So what you need to realize here is that independent chunks get executed in some thread - and when they run… then well, they run in the specific thread. In order to keep track of the AD tape all what we need to do is swap the thread local pointer to the right chunks of the AD tape. This is what I called a scoped AD tape in my design RFC.

I really think this is the way to go… and it took me a while to figure it all out, so if the above is not so clear we can discuss if you think it helps.

bbbales2 · December 11, 2019, 5:05pm

In this way, TBB is doing the same thing with threads that would happen if you launched a bunch of MPI processes (they’d all have their own stacks).

Not to say these are the same by any means (TBB provides a scheduler and also we’re shared memory), but this model isn’t incompatible with the reverse mode design I sketched out above, so I like that.

This sounds like if we have two different functions to call in parallel, f(a) and f(b), they might get executed by the same thread and so they’ll allocate on the same autodiff stack.

Are we getting a stack per task or a stack per thread? I’d prefer the first.

I had a look at the older proposal and the part of the ScopedChainableStack thing I didn’t like was how it can be copied back to the main stack.

I don’t think we need to do this. I think ScopedChainableStacks can be managed by varis on in the main autodiff chain and we can link in these separate autodiff stacks at appropriate times during the chain operation (and then we get reverse mode for free).

So here’s an example vari in the form of the adj_jac_apply operator (ask if you have questions).

The goal is to call the function ‘func’ on each element of an array of scalars (func returns a scalar).

class tbb_map_rect_op {
  ScopedChainableStack* stacks_;
  public:
    std::vector<double> operator()(const auto& func,
                                   const std::vector<double>& individual_params) {
      int J = individual_params.size(); // number of jobs
      output_ = ChainableStack::instance_->memalloc_.alloc_array<double>(J);
      
      stacks_ = allocate_chainable_stacks(J);
      
      for(int j = 0; j < J; j++) {
        stacks_[j].execute(std::bind(func, individual_params[j])); // dunno if bind is actually very fast or not
      }
      
      wait_all(stacks_);
      
      Eigen::VectorXd output(J);
      for(int j = 0; j < J; j++) {
        output[j] = stacks_[j].tail_vari()->val();
      }
      
      return output;
    }
  
  std::tuple<std::vector<double>> multiply_adjoint_jacobian(const std::vector<double>& adj) {
    int J = adj.size();
    std::vector<double> adj_times_jac(J);
    
    for(int j = 0; j < J; j++) {
      stacks_[j].tail_vari()->adj_ = adj[j];
      stacks_[j].grad(); // Presumable grad launches a new task to handle going through all the varis and calling chain in each stack
    }
    
    wait_all(stacks_);
    
    for(int j = 0; j < J; j++) {
      adj_times_jac[j] = stacks_[j].head_vari()->adj_[j];
    }
    
    return std::make_tuple(adj_times_jac);
  }
}

This does parallel forward and reverse mode.

In these ScopedChainableStacks you’d need to be able to reference the first and last varis. There’s an assumption here that this tbb_map_rect_op knows what type of varis to expect on the top and bottom of the stack. We know what to expect because we specified the signature of func.
You’d need to allocate them on the ChainableStack thing that calls destructors so that those destructors could free the memory allocated in any other stacks (I forget how this is done). I may have muddled details of allocation on the ChainableStack memory thing. I would need to look at that closer to know exactly what should happen.

Doing this, I don’t think we need an append_to_parent_stack() operation. I’m not sure about a recover() operation.

wds15 · December 11, 2019, 6:44pm

I will answer in more detail later. I also did not like too much the copying stuff (the append)…but that’s the easiest thing to do. Otherwise you would need to take special considerations for set zero adjoints, for example. So if you want to avoid shaking up everything too much, then you copy things back to a single tape. I would suggest to shortcut here by copying. We can optimize that later would be my position for now.

Need to look at your suggestion.

Btw, having independent ad tapes per thread is not the same as mpi stuff…you still have a shared memory so that you do not need to worry about copying stuff around. References will do already.

wds15 · December 12, 2019, 8:13am

Not sure if I get everything, but here are some thoughts I have to this… there are a few problems from my perspective. One question: Where are the vars instantiated on the sub-stacks? I don’t see that.

So first, what’s good is that you nicely hold all ties to the sub-stack together and there is no need to do the append thing. However, the for loops you do basically do the copying work - so there is not too much gained in this regard (unless the function called creates a huge internal tree, I guess).
A clear plus is that you can run the reverse sweep with parallelism, but this comes at the price of having to calculate the function values using doubles in the forward sweep. That’s probably very inefficient for the eager calculations we do with all the prob stuff. Still, this logic could be implemented in my original approach as well, I think.
Because you scatter the different scoped AD tapes and do not merge them, you will fail to set all adjoints to zero (the set_zero_adjoint method of var’s is not virtual for good reasons). That’s a big one. When I discovered this while thinking about the multiple AD tape design, I figured that I would need to create a tree of the AD tape dependencies… yack. So I went with making it simpler by just throwing it together.
I also do not like that you need to know what exact operands go into it. This makes it very hard to program, but I do admit that this is valuable as it opens up more possibilities to do tricks. In particular, the reverse sweep can only be parallelized if you have knowledge of the operands.
Apart from that… this map_rect style thing is not what we really need. This design encourages to drop vectorisation! And this is really big again. We want parallelism over huge containers over which we would in a serial program call a vectorised statement. In the parallel world we want to let the TBB decide how to chunk the work (so we do not control the exact splits). However, when you allow for that, then you do not know anymore which operands get processed on which sub-stack. It will be a blocked selection.
I am also not sure if this design works with nested parallelism. Maybe. What is also problematic is nested AD.
Whenever I see “wait” I am worried about deadlocks. I am not saying there is one, but that’s to worry about.

Maybe I did not get a few points here and I probably sound rather negative. Sorry… it’s very complicated. Now, I do agree that the approach you bring up is a starting point to parallelise the reverse sweep.

bbbales2 · December 13, 2019, 1:40am

Yeah, my bad, I left this part out.

I think what needs to happen is this code:

stacks_ = allocate_chainable_stacks(J);
      
for(int j = 0; j < J; j++) {
  stacks_[j].execute(std::bind(func, individual_params[j]));
}

needs changed to something like:

stacks_ = allocate_chainable_stacks(J);

// These new vars will act like the proxies
// We don't use the vars from the original stack cause that isn't threadsafe
std::vector<var> individual_params_allocated_on_new_stack(J);
      
for(int j = 0; j < J; j++) {
  individual_params_allocated_on_new_stack[j] = copy_to_new_stack(stacks_[j], individual_params[j]);
  stacks_[j].execute(std::bind(func, individual_params_allocated_on_new_stack[j]));
}

So we copy the input arguments into varis allocated on the new stack. Because of this, we’d need to be able to know which inputs are varis or not (so we don’t allocate vars for things that aren’t vars).

There’s a variable for this in the actual adj_jac_apply implementation, but that abstraction is probably the wrong thing. I can rewrite this as a vari pretty easily if that would help (otherwise I’ll skip that just to avoid dumping more code here).

These copies shouldn’t be that big. It’s just the number of inputs + outputs.

Good call. But a simple solution to this is to just keep track of all the stacks we allocate?

That could be embedded in:

stacks_ = allocate_chainable_stacks(J);

And when we zero’ed the main stack, we could just loop through and zero each of those.

Yeah, my reasoning for this was at first we’ll only have a couple signatures (map_rect + map_reduce or whatever goes in with this). Eventually everything will presumably get passed to us as some sort of stan_language_closure_functor object, at which point ever function will have the same signature.

Yeah but this goes in that direction.

Like, map_rect is awkward because beforehand you chunk up your J job items into G groups.

But really we want to leave all J things to be parallelized however they may be, right? This is compatible with that idea.

So we’d want the allocation of these autodiff stacks to be fast enough that we could allocate one for every task we did.

I’ll assert without evidence that I think this can be fast enough. Though presumably if we allocate each thread a stack at the beginning and it just re-uses it for different tasks it would be faster.

I’ll argue that isn’t necessary :D. But that’s more cause I want a simple MPI implementation.

I think the ticket to making nested parallelism work would be making sure:

stacks_ = allocate_chainable_stacks(J);

is threadsafe.

I think this should be equally okay in either implementation? The difference here is that there’d be more ScopedChainableAutodiffs and no copying autodiff stacks after a task finished. The operation of the ScopedChainableAutodiffs themselves would otherwise stay the same.

We’d have to wait on tasks to finish before we copied autodiff stacks back to the main stack too, so I think we’re nervous about these either way.

Nah nah, good critique.

It’s also compatible with a possible future MPI implementation (where copying things back would not be an option).

Bob_Carpenter · December 13, 2019, 11:38pm

I think we can analyze work in the language to some extent. But it’s tricky given that we have a very general purpose language. But we’ll still need to wait on function results before using them, so we can’t just do everything in parallel—we have to find subcomputations that can be done independently.

To me, the biggest deal is that it’s easy to share memory with threading and it’s easy to spread across multiple machines with MPI. On a single machine, multi-threading should be better than multi-process.

Literally making it simpler and easier to understand and review shold increas the chances of it getting through. Not necessarily enough, but better than if it’s complex and hard to review.

What blocked it?

That’s also where I’m getting lost at where the TBB magic stops and the hard work begins.

wds15 · December 14, 2019, 10:39am

So these are like the proxy vars I was referring to? I agree we need that if we want to parallelize the reverse sweep and do not do a cut like thing as done in the current map_rect.

Never underestimate copying. The performance of any stan program depends on the size of the AD tree. It always hurts. This stuff is what will limit the scalability considerably.

There is more. With each stack you allocate you also have some memory pool associated with it. So the recover_memory operation also needs to walk over all sub-stacks… and for recover_memory it’s not too clear for me what it will be doing. Get rid of the childs? Not sure whats best.

The problem with keeping track of the child stacks is that the childs can have childs themselves. I do not think that it would be good to make the root level special in any way. So I think this means we will end up with a tree of stacks. There should not be a difference between a child stack or a root stack. The root stack is merely the root because it has not parent. Make sense? If you design like this, then nested parallelism is no problem any more.

I am not saying we have to support nested parallelism for the purpose of encouraging nesting, no. However, the design will be much cleaner and more robust, I think, if nested parallelism is supported “by design”.

Yes!

No, I am relatively sure that this is a false conjecture. What we want to parallelize as well are large vectorized expressions. So in Stan language this:

target += some_heavy_dist_lpdf(big_y | parameter_1, parameter_2);

Ask yourself how you run this in parallel efficiently - thats the challenge! The prototype I wrote is able to do this.

What you are suggesting with going to many small stacks is to encourage this pattern:

for(i in 1:N_huge)
  target += some_heavy_dist_lpdf(big_y[i] | parameter_1, parameter_2);

but we know that this is a very bad idea, since it creates N_huge many nodes on the AD tree as opposed to the previous statement. So the trick is to parallelize these statements with as few copies as possible. Anything else will kill your performance. The proposal I brought forward allows the TBB to chunk the huge reduce into arbitrary big/small chunks - according to load.

From our discussion so far I take that you are very strong for scattering the AD tape into sub-tapes and leave them like this in memory. I am fine with that, but bear in mind the downsides:

this is an even bigger change => probably harder to get in
whenever parallelization of this approach is used the structure of the AD tape will change. So the argument cannot any more be made that after parallel regions the AD tape structure is the same as before.
a number of assumptions we have lived with so far are gone. For example looping over the AD tape is something which is not anymore straightforward as the thing is scattered.

My take from the downsides was to live for now with the inefficiency of appending to parents (doing the same for the memory, BTW), but as the upside we still have the same structure after parallel regions are processed. We can still do the tree thing later is my position. Going into sub-divided trees is sensible, but it will be an even bigger shake-up of our AD tape.

Yes, indeed… and parallelizing the forward sweep of reverse mode is a lot easier than reverse mode. That’s why I would start here.

There is more. With the TBB you get even rid of having to deal with threads. Instead you throw work at a scheduler and harvest results. Much more comfortable! In addition, the TBB is a C++ library and not like MPI which comes from the 1990 written for super-computers based on C. Moreover, MPI will probably be a good thing to parallelize chains potentially. The reason is that MPI can easily move around blocks of memory which don’t have any structure other than being continuous arrays. So the parameter vectors on the unconstrained space when manipulated by the sampler follow exactly this form. So there it could be doable to go with MPI since all you need to do is move around blocks of memory chunks while you don’t need to deal with detailed types, data structures and all that.

It took me a while to figure it out… and it apparently takes me really long to percolate that knowledge on. I am not sure if we make it… given that I have to restart from scratch.

I really hope that once @bbbales2 and myself land on the same page we have enough momentum to get this through. It’s starting to really drag at least me.

bbbales2 · December 14, 2019, 4:16pm

Hmm, okay. I think I see what you’re getting at here. So if we used TBB and we had 1000 tasks and 4 cores, TBB can split them up over 4 threads and each thread is managing an autodiff stack. I was suggesting splitting up the 1000 tasks into 1000 different stacks.

So the reduce is important to efficiency in this operation, right?

Because the output from your solution (if the gradients are done with ~~forward mode~~ edit: reverse mode embedded in forward mode) would be 4 vars (one for each thread, each thread reducing a bunch of tasks) that could get added into one var. In the solution I proposed we’d still have 1000 vars that would need added, and we’d autodiff reverse mode.

Lemme get that and I’ll think a bit more and respond to the rest of the post. I think it’s still likely we can do everything here with the same ScopedAutodiffStack things (or whatever we call the AD extensions).

Lack of buy in, I assume.

I think that if we collectively want to make something happen, it will be easier than harder. Doesn’t matter how easy (or warranted) it is, if we don’t wanna do it, it will be painful. Not that Sebastian hasn’t tried to generate buy in for this previously, but I just wanna get my critiques out of the way earlier than later (“earlier than later” is kindof an amusing thing to say given how long this has been around, but that is the situation).

And if it’s something I want, then it’s easy for me to jump in for the later debates.

Bob_Carpenter · December 14, 2019, 7:01pm

The forward sweep is much much faster than the reverse sweep becase it’s essentially compiled code whereas the reverse sweep is all pointer chasing calls to chain() and essentially interpreted code. As we move more toward a lazy adjoint-Jacobian formulation rather than a create-the-jacobian formulation, more work will get done in the reverse sweep and it’ll also be faster, so I’m not sure how that’ll affect the forward/reverse work ratio.

Instead you throw work at a scheduler and harvest results. Much more comfortable!

Yes, but presumably also less performant than dealing with the scheduling ourselves in a controlled way. Creating objects to put on a scheduler, scheduling and taking them off isn’t free. I’m not an expert in thread scheduling, but my understanding is that it involves the same tradeoffs as we have with memory management. Good garbage collectors are great and make a lot of things possible that are very painful otherwise, but if you have a good way to manage your own memory (like our arenas for autodiff), then you want to use that.

bbbales2 · December 19, 2019, 4:38pm

I was thinking about autodiffing the example:

a_i = i\\ b_i = f(a_i)\\ c = \sum_i b_i

For i in 1 to N. So that means taking in a vector of i things, computing a function of each, and then summing them.

In terms of multiple stacks I think the way to think about this is to add an extra subscript to every variable which indicates which stack they live on. And so a version of this using two stacks (stack 1, and 2) looks like:

a_{i, 1} = i\\ a_{i, 2} = a_{i, 1}\\ b_{i, 2} = f(a_{i, 2})\\ c_2 = \sum_i b_{i, 2}\\ c_1 = c_2

where I added a couple operations to move things between stacks (in the simple case of non-recursive stacks, I think if these copies are left to the main process it’s all good).

And if we want to do the bulk of the work in the forward pass we can precompute all the \frac{\partial c_2}{\partial a_{i, 2}} terms and copy them into the c_1 vari with pointers to the a_{i, 1} varis since \frac{\partial c_1}{\partial c_2} = 1 and \frac{\partial a_{i, 1}}{\partial a_{i, 2}} = 1.

It’s still doing the same amount of work as if we’d done everything in the reverse pass, but that stack never gets polluted with all the varis from all the different 'f’s which if we had lots of things could mean the difference between operating in cache or out of cache.

The specifics of how the precomputed partials are copied back to the main stack I think should go in the vari attached to whatever the map_reduce operation looks like. I don’t think that is part of the autodiff stack. Correct me on that if I’m wrong.

What I’m trying to get at here is the ‘append_to_parent_stack()’ operation in the ScopedAutodiffStack. I don’t think this is a good operation to include. I do think we need the ability to grad() a ScopedAutodiffStack though. Does the design you proposed have a grad? I don’t see it in the list but I think it has to be there.

wds15 · December 19, 2019, 5:37pm

It’s not there for two reasons:

My suggestion was to stay with a single tape design whenever we run not parallel. So at the end of each parallel region we would (during the forward sweep) append things to the parent. This way you would never need to call grad on these sub-stacks, I think. Is that wrong? The other up-side of staying with a single AD tape is that we have minimal side effects, since everything stays the same basically (set_zero_adjoint, recover_memory, whatever else).
Even if we say we keep things on separate stacks even during the reverse sweep, we designed our library such that you call a global grad / set_zero_adjoint / whatever function which then modifies the global (or by now optionally thread-local) AD stack. I don’t like this decision myself any longer given that we have multiple stack as of now. So it would make sense to move all these methods as members into the AD stacks directly… but again this would be a massive shake-up of all of our internals…so I am not keen in doing that. This may sound like I am a lazy programmer, but lazy translates to me here into little side-effects.

I like the subscripting idea…not sure I follow how you assigned them as we would split the sum into sub-chunks which endup on different stacks (and the realized sub-division is controlled by the TBB).

EDIT: I have updated the RFC considerably. In essence I try to state more clearly what is needed for the AD tape and why certain simplifications are done. I hope to do one more pass over it before submitting it. It’s a lot of material, I am afraid. Here is where I am with it:

github.com

wds15/design-docs/blob/parallel_autodiff/designs/0003-parallel_autodiff.md

- Feature Name: parallel_ad
- Start Date: 2019-04-20
- RFC PR: (leave this empty)
- Stan Issue: (leave this empty)

# Summary
[summary]: #summary

The goal of this RFC is to expand the CPU based parallelism
capabilities of Stan math. This RFC proposes to introduce a new
parallel map with automatic sharding and a parallel reduce which
combines automatic sharding with vectorization.

# Motivation
[motivation]: #motivation

Currently `map_rect` allows map parallelism, but it brings a few
restrictions:

1. Very user unfriendly due to the requirement to pack and unpack data

This file has been truncated. show original

(I hope this is not totally off, but I don’t think so as we appear to agree on key bits of independent AD tapes; everything is up for discussion; my intention is really to get the big fat low hanging fruit of parallel forward sweeps in ASAP… the rest can follow)

Topic		Replies	Views
Design doc on parallel autodiff from Sebastian Developers math	15	782	May 25, 2019
Parallel autodiff v3 Developers	61	2855	January 23, 2020
Help write an autodiff handbook Publicity	19	1919	June 8, 2020
Higher-Order Autodiff, testing framework & RHMC Developers	49	3358	May 11, 2019
Growing PR backlog & autodiff test framework Developers	9	916	February 7, 2018

Parallel autodiff v2

Summary

Outline of proposed operations

Operations of the main process and the worker processes

Extensions to support the main process

Extensions to support the worker processes

Drawbacks

Related topics