Design doc on parallel autodiff from Sebastian

Hey all,

@wds15 has a design doc up documenting his proposal for parallel for and and reduce using TBB and a refactored independent autodiff stack mechanism. Please put any comments on the specific proposal on the github PR; this thread is just announcing it and explaining a little bit about the process we’re trying out.

Here’s the intro text:

This RFC will enhance parallelization facilities in stan-math.

At the core it does implement

  • parallel for with automatic sharding
  • parallel reduce with automatic sharding and vectorization

To get there a few things need to be refactored and introduced

  • independent AD
  • refactored nested AD
  • Intel TBB is proposed to implement the parallel algorithms

We’re trying out the design doc process that we’ve copied and slightly adapted from the Rust community’s RFC process. The basic idea is that we’d like to get some of the more complicated designs formalized and get a set of basic questions answered about all of them before we start writing code. This is an opportunity for the entire community to comment and try to make the proposed design better. We’ll give it something like a minimum of a week to air publicly and get to a decision on it in a few weeks max (that decision could take a variety of actions including postponement, but we’ll seek some action and closure on the issue in that time period). Please see the design docs repo README (and the original Rust one if curious) for more details.

@syclik, do you have time to be the reviewer for this one? We definitely need to get your comments if possible as this is a change to the math library more involved than just adding new functions and classes.

Thanks all!

3 Likes

Yup. What am I supposed to do in the review?

The big question to answer is if you agree to the concepts introduced (independent AD, refactored nested AD, using the TBB as a workhorse). Maybe you have a short look at the template to see what information I was aiming to fill into the sections as guided by the process. The document is hopefully easy to follow; if not, then comment on that maybe first so that we can resolve that. Ultimately, this document describes where we are heading design wise to get things working in math and here we need alignment.

As it’s a way of thinking for me to write code, I have a prototype of things up on Stan-math. That’s not meant to become the implementation we end up with, but you can glance at it there if you think it helps to fill holes in what I described there (parallel-ad-tape-v3; maybe start with the for_each test … which is not super clean, but hopefully still helpful).

Yeah - the hope with this process is that the community can all comment line-by-line (as there’s a few topics in the proposal that all work together) and then the reviewer can do something really similar to a code review, but for the high level ideas. So probably less style comments and more polishing the ideas and architecture and asking for enough clarification that it’s clear what’s being agreed to. Here’s an example of a Rust RFC with a bunch of comments that was eventually approved and thus merged in: https://github.com/rust-lang/rfcs/pull/2645 At that point anyone should have confidence working on implementation of the design that they’ll be doing work that’s along the right track because the design has been reviewed at the high level.

I guess this work incorporates A Geometric Theory of Higher-Order Automatic Differentiation, https://arxiv.org/abs/1812.11592 ?

No - I did not dive into that. I probably should… given this resource limited world I did not find the time yet… and BTW, if someone wants to give a hand and implement stuff - there is a lot to do and help is very welcome.

1 Like

Can you say more about how you’re thinking that paper would influence the memory layout and nested API of autodiff?

I just got the superficial, vague understanding from the paper that some of the calculations for higher order autodiff are redundant and can be factored out. It seemed relevant if you’re trying to make things faster.

Ah… sure, if you have certain continuity conditions met, then that is key to take advantage of… but here I am shooting for getting simple reverse mode running in parallel. That will be useful for all models we can run in Stan right now (only needing gradients) and it will also be useful for that higher order world (should we ever get there).

Loving all things parallel and Stan related I would like to help out. If you have any specific stuff that needs to be done, assign/tag/pm me.

2 Likes

Same!

1 Like
The suggested building blocks per se only parallelize the *forward*
  sweep of reverse mode AD, but not the reverse sweep. Thus, the
  operations which involved `adj_jac_apply`, for example, are not
  speedup at all with this. However, this can easily be resolved by
  introducing a general AD tree compression facility which works just
  as `map_rect` does. Such a facility needs to know all the operands
  and can then do nested AD and perform on the spot gradient
  evaluation of the function of interest and store results as
  `operands_and_partials`.

What’s the difficulty of adding mechanisms to parallelize the reverse sweep? Is there any way that this refactor might make it easier to do that reverse sweep? Does this refactor in any way hamper our ability to do the parallel reverse sweep? Or is that tangential?

I’m a fan of the reverse sweep (except that time the Warriors lost).

Hey! Please move this discussion to the PR, thanks!

@rok_cesnovar & @stevebronder much appreciated! For the moment the design PR is what needs to get through. So going over that and commenting is very much appreciated. Doing so will sync all our brains in terms of what is to come.

Once the design is agreed on I am happy to draft a working plan for this which we can then use to setup planned issues+PRs. At that point we can work along that plan.

1 Like

@rok_cesnovar… one issue to take on for parallelism is this one: https://github.com/stan-dev/math/issues/1250

It is about reverting the commit mentioned to go back to the lgamma from boost since the std one introduces global locking which destroys any parallel performance…including map rect.

The crux is that the std versions seems faster, but it has no guarantee on the output precision. In addition you may have to deal with border line cases…but if you could handle that one - that would help.

Sure. Will look into that. Assigning myself.