Parallel autodiff v2

If we aren’t evaluating Jacobians in forward mode or aren’t parallelizing reverse mode, this won’t scale for general purpose functions. Forward mode will be faster, but then reverse mode will still be single threaded and slow.

lpdfs will be a bit misleading because we’re computing the Jacobians manually in forward mode with the ops and partials stuff.

I think this makes sense still. Within one process, there is only ever one sampler running one model.

The separate AD stacks we’re making are subservient to the main one, for sure. But I don’t think we really need for them to be standalone.

And the single primary-stack design makes things like the global grad/set_zero_adjoint stuff doable.

Cool cool, couple comments:

- independent chunks of work can be reordered (no more unique
  topological sort of operations)
An alternative approach is to allow the AD tape to grow in many
independent sub-trees. So after each parallel phase the chunks are
not merged together to form a single AD tape. These sub-trees are
then linked together in a tree-like structure which must allow for
traversal in reverse order as needed during the reverse sweep

The simpler version of parallel reverse mode doesn’t change how varis on the primary stack need organized.

Each of our parallel calls, parallel_reduce, parallel_map, whatever, have a vari on the primary autodiff stack. These varis can manage a bunch of sub-stacks that are allocated in the forward pass and then accessed again in reverse mode, and these varis can manage the specific parallelism however they like.

(all global operations like
      `set_zero_adjoint`, `recover_memory`, ... all need a rewrite)

This is true, but the primary stack just needs to maintain a list of sub-stacks. If parallel tasks are limited to not launching other tasks, only the primary stacks allocate sub-stacks so it’s easy to track them. No need for it to be that exotic.

We can only explicitly compute Jacobians in the forward pass of reverse mode when there’s a single output. We wouldn’t want to do this for multivariate output functions.

I would think we’d want to use forward mode to evaluate Jacobians in parallel, with one thread per input variable. The real limitiation is that forward mode doesn’t handle our ODE solver or other solvers.

Whatever we do, I hope this doesn’t negatively impact single-threaded AD performance in any other way than a small constant penalty per log density eval.

What I am referring to here is that the independent chunks can be reorderd. We need this to use the Intel TBB task scheduler. The scheduler does not give us any gurantess as to what chunk sizes are run, nor do we have control over the order of execution. You can, of course, constraint the TBB more if we think this is important, but that comes at the cost of performance obviously.

I think this is more complicated to do. First, I do think that it would not be a good design to have a main / sub-stack design… what would you do for nested parallelism? I am not advocating the used of nested parallelism, but allowing for it will make things a lot more flexible and composable which is a big thing, I think. To me we should create something where a stack has a parent stack and the main stack simply has no parent. The main stack should not be special compared to the child stacks and the design should allow for nesting.

True, true… but the ultimate goal in any Stan program is the log-likelihood which is a single value. So as long as you stuff enough into the to-be-parallelized function it will be eventually producing just a single number.

The ODE solver do all their work in the forward sweep. See the stan-math wiki for an idea of how to automatically parallelize the adjoint work of the ODE solvers (separate topic).

No, it is not going to affect any performance. The AD tape structure will remain unchanged if you do not use the parallel functions. In case you use the new parallel functions and we go with the design I propose then we will always only have a single global AD tape after you leave the parallel functions as we merge the different sub-stacks together. That’s currently under discussion.

Right. The clarification was that this isn’t a requirement on the varis in the autodiff stack itself.

The grad call in the new autodiff stack doesn’t need to read the varis in the chainable stack out of order. Those are still ordered in the same way.

We could allow parallel maps inside parallel maps at a language level without actually implementing it that functionality. This could be done with some sort of “if I’m not the main thread, be serial, otherwise be parallel” sorta logic.

For threading, I think we can just make stack allocation threadsafe and then nested parallelism works.

We could go this route. I am just hesitant to bake it in as a requirement at this stage. Maybe it will just end up this way in implementation. The main stack assumption seems convenient to me in terms of memory management, but I could be wrong.

What bugs me about this is we’re either making weird recommendations to the user (don’t return more than a scalar from your parallel functions) or putting weird constraints on our programming constructs (don’t allow functions computed in parallel to return anything other than a scalar) when we can avoid both of these things by implementing reverse mode autodiff parallelism.

The MPI map_rect does work this way of detecting if you are already inside a map_rect call or not. This works with global locks… nasty, nasty… I would not do it again.

Having worked with the code, I do think that this is the right way to not discriminate “main” vs child. There is nothing special to me about the “main” AD stack other than that it has no parent.

Memory management should always stay with the AD tape within a thread. This way you respect memory locality in a built-in way. Memory locality is a bit more tricky with parallelism as it will depend on core affinity as well.

This topic is hard! This is why we should provide a parallel map which gives maximal flexibility to users at the cost of some left-out optimizations, since the output is an arbitrary data structure rather than a scalar… and in addition we code up a parallel reduce which targets the main thing we do in Stan programs: summing over a huge number of independent scalars.

I just filed the v2 RFC as a PR to the design repo. It’s now 8 month the old one was around and we merely got the TBB in… let’s cross fingers this is somewhat faster…

Alright maybe there doesn’t need to be anything special about each stack other than unused stuff. It is a fair design goal.

I disagree :D. The solution for this is parallel reverse mode, which is at least as do-able as everything else.

In my understanding (tell me if I’m screwing this up), our disagreement is between:

  1. copying back varis allocated in different stacks to the main stack vs. leaving varis in whichever stack they are allocated and
  2. making it possible to call grad() on these different stacks (necessitating copying adjoints into and out of these stacks)

The second has the advantages of allowing for reverse mode parallelism and being something that we can do on MPI. The first thing doesn’t make any sense in MPI-land cause pointers aren’t something we can ship around (adjoints are though).

It’s not flexibility; it’s a footgun! We shouldn’t provide a way to map multiple output functions if we know that the way we’re computing these things is going to be inefficient.

From the Gelblogg today: Fitting big multilevel regressions in Stan? | Statistical Modeling, Causal Inference, and Social Science – the way these things get communicated is “I dunno – give it a go.” The caveats don’t get carried along, fairly or not.

If we end up going with a forwards only thing, we should only allow for functions with scalar outputs. Parallel map is still fine as long as we do the Jacobians with nested autodiff.

I don’t think that the parallel reverse mode is easy to do. It does require you to know exactly what operands go into the function we evaluate. I do not require this at the moment. We can design things such that we can do this at a later stage, I think. Tracking all operands is really hard to code up, I think.

Yes, I would want to avoid this. Leaving things in the sub-stacks would have mega side-effects and would be an even bigger refactor than what we already do here. I am very much afraid of all the additional side-effects and all the additional work this requires. We should not shake it up too much. This can be done at a later stage.

Hmm… maybe this shows a mis-understanding. So I want to create the sub-stacks…but these will be linked in memory via pointers to the main stack automatically, since I would simply stick the vars from the main stack into the function which we evaluate within the sub-stack. This way I do not need to worry about shipping anything back and forth. That’s to me super elegant. Right now nested AD is an AD operation without any ties to the outside world…but what I propose is an independent AD where the AD (sub) stack is a new stack with links to the outside world which happens to be the main stack.

I don’t quite see the need to call grad only on the sub-stacks. We only want to call grad on the entire stack which happens at the very end when everything is merged together again to a big stack.

Again, managing multiple stacks for the entire life-time of the AD tape is a mega deviation from what we do now and will have huge side-effects. I don’t want to go there (now). It can done, sure, but it can be done later.

I do not target MPI! I am very explicit about targeting the Intel TBB as a work-horse only - otherwise we will get stuck quickly and with the TBB a lot of crap is taken care of for us like managing threads, scheduling and all that - I do not want to reinvent the wheel for that again. I don’t mind leaving things such that at a later stage you can come up with a MPI thing here (it will be a pain to program and very likely come with a lot of restrictions would be my prediction). I have introduced MPI to stan-math for good reasons, I have coded a lot with it, I would not do it again. It’s still great to see people using 700 cores for a single chain though.

No - that’s not correct. It is possibly not ideal for a number of Stan programs, yes… but any hierarchical ODE model or something similar will benefit greatly from a parallel_map. All the work for ODEs are done in the forward sweep.

I don’t think we should restrict the output to scalars - no. Solving ODEs in parallel is an important enough application, I think. Doing the parallel map based on a nested autodiff approach is a solution, yes. However, this will require the knowledge about all the input operands which I wanted to not be obligated to do already now. We can go there at a later stage, I think.

By all means - we should harvest big low hanging fruits first (without designing ourselves into nomansland, of course).

Everything is explicitly typed in the Stan language, so this requirement is enforced from that level.

When we move to closures I’d expect the objects exposed to us at the C++ level to have a fixed interface (though I don’t know about this – maybe @Bob_Carpenter has an idea).

But if we allow vars which contain varis allocated on the main stack to feed into the substacks, that means we’ll end up with varis on substacks that point to varis allocated on the main stack.

Which means we’ve made assumption that varis can contain pointers to varis allocated on any stack.

The linking that we’re getting with this is between varis (a vari allocated anywhere can reference another vari allocated anywhere).

I don’t think the proposal actually links autodiff stacks together. As I understand it, the parallel_map call would be responsible for copying all the varis allocated on the substack back into the main stack before it finishes.

Our grad call remains unchanged in the proposed design and is just a big for loop through the chainablestack. So anything not copied back gets missed.

Well even if the need isn’t clear, it would be an easy operation – just a loop through the chainable varis on that stack.

The assumption that varis allocated in any stack can contain pointer references to varis allocated on any other stack is the thing that could have big side effects in my mind.

It’s like allocating in one place and freeing in another.

If we moved to parallel reverse mode then we’d have to deal with race conditions on the adjoint accumulation.

Yeah I get that. I want whatever changes that happen to the autodiff to be future-proof. I’m exaggerating a little here, but the bare minimum short list to be future proof is basically:

  1. Substacks that can be grad’d independently of each other
  2. zero_all_adjoints() and recover_memory() to zero all adjoints and recover memory on all the stacks allocated in the autodiff system, not just the main one
  3. In the new parallel_* varis, copy vals/adjoints between substacks.

The autodiff architecture isn’t something that we patch on a weekly basis, and I think that however reversible decisions look now, they’ll get baked into permanence pretty quickly.

Let’s push people for the single output lpdf designs. Anyone using parallel stuff is going to be concerned about performance. We should make sure when they’re using parallel stuff we aren’t introducing any hidden inefficiencies. The lpdf, single-output sorta design isn’t that limiting.

Well… not really. You are right in that we should now, but my design I had in mind would have allowed to pass in a fat closure which has lots of surprises in it. By that I mean lots of data structures which contain a var (even private members if you want). The crux with these is that you have no control over them as to what happens on the AD tree. This is why things endup being inter-linked without us having enough control over it to do a controlled sweep over it.

Yes. That’s what I wanted.

At the end of the parallel_map we copy together with the intention to minimize side-effects. I do think that this is a very fair design goal. I know this copying is not ideal.

True, a grad call is easy.

That’s why I copy.

And tha’ts why I don’t go parallel on the reverse sweep, yes.

Ok, we can concentrate on that. That’s fair… as long as it can be used easily and without clunky packing needed for map_rect.

Anyway… I figured that you anyway insist on the posibility to do reverse mode in parallel. That enforces that we track exactly what operands go into the function we evaluate. If we take this a few steps further, then this stuff can actually already come alive on the back of our current AD (possibly, not 100% sure yet). Taken that

  • the functor you call has no internal state so that there are no hidden vars (or a closure object which has only data in it)
  • we know the exact signature of all the inputs
  • we do nested AD where we sub-slice from a large data-structure and we copy all the common vars into every sub-context to make things independent.

then we can do nested AD with each chunk of work and that’s it. We calculate the values and the gradients in a nested AD (so forward and backward is done on the fly) sweep and finally assemble the final var in the main stack. What is still a concern is with nested parallelism, but I think that the so-called thread isolation technique of the TBB can rescue us here (that’s under task_arena doc somewhere).

So before we proceed with the parallel v2 thing, we should probably benchmark how well that works. Maybe that copying is not so bad after all. The only really ugly thing is that we need to fight with C++ tuples which I really do not like to handle (in particular if you need to call functions with them). So my suggestion is to hard-code an example without any generality of tuples. This should be done for a simple poisson model and a hierarchical poisson model. If we get decent speedups there and we can confirm that nested parallelism works… then we can forget AD system refactor and just work with what we have. It would not be ideal, but we can pull it off with the current AD system (I am really not yet sure… in the past I thought I have to refactor it for sure, but maybe with task isolation we can do it). I have started that work here (there is also a unit test on that branch firing up this code):

https://github.com/stan-dev/math/blob/proto-parallel-v3/stan/math/rev/scal/functor/parallel_reduce_sum.hpp

Adding a link to v3: Parallel autodiff v3