(older) parallel AD tape ideas

Haha who knew c++ could be so wholesome

2 Likes

Uhmā€¦ I was curious how well these ideas can work and started some work on this. So I took reduce_sum and added the idea to have thread local copies of all the input operands in a ā€œfreshā€ instance of an AD tape. The benefit of this is that we donā€™t any more copy the shared input operands per work chunk, but per thread we use. Hence, the slowdown due to more chunks will get less.

I took the brms threading vignette and ran it with the new stuff and with the current implementation.

Current slowdown:

Improved version:

So you see that we have less overhead due to more chunks.

Looking at the speedups we get, this is noticeable - I think - but still a bit noisy here.

Current speedups:

New speedups:

I think this shows that this approach has merit. I mean, we do reduce the overhead due to finer chunking - and thatā€™s good news, since users would not anymore need to worry so much about the grainsize since itā€™s not such a problem anymore.

The code changes are actually relatively smallish:

Itā€™s the branch feature-fast-parallel-ad

EDIT: the above speedup graphs used 25 iterations. This was seemingly too few. Here are the same results with 50 iterations.

Current version:

New version:

So itā€™s very clear the benefits we get from the new implementationā€¦ I want thisā€¦ ;)

3 Likes

Plot the ratios of the time of the new code vs. the old code. This trying to figure out ratios between two plots is too hard lol. It does look like the new stuff is faster.

But yeah open a draft pull request so we can put comments on the code.

Are u not convinced? Itā€™s so obvious to me. The 10% overhead mark is clearly crossed with 8 chunks for the old stuff. The new one does that between 16-32 chunks - so much later. The total runtime is also faster for the new code when you have up to 4 cores.

I will open a draft PR, sure.

I think I said my sentences in the wrong order or something and gave the wrong impression. I agree it looks faster in an important way, but Iā€™m having trouble telling how much faster and I would like to know this (and how this changes with different grainsizes like you showed and also cores).

I read off the first one that the red line was like a little above 15 in the new code and a little above 19 in the old code, and I can kinda compute that ratio in my head (and itā€™s big), but then I have to do it everywhere else for all the other grainsizes yada yada and itā€™s hard. Anyway I wasnā€™t trying to be mean! My apologies

No worries. I did not perceive anything as mean. I was just surprisedā€¦ and after all I am claiming something here to be better, but I was just lazy as I ran the report I already wrote. Though, I did make the scales the same, which it was not before.

Now, plotting ratios of old and new would result for the chunking slowdown in a ratio of a ratioā€¦ which is also odd, but maybe what we want.

1 Like

I personally like seeing the ratio. Though itā€™s less interpret-able relative to actual time I think it makes sense when we only care about comparing one thing vs another

Looks neat! Changes are tiny so I think this makes sense to add.

Iā€™m still trying to think though how we would top this off by doing the reverse pass in parallel as well. It might be easier to talk about this in a draft PR.

@wds15 what Iā€™m having trouble thinking about is how to take this piece from reduce_sum and do it in parallel

// initialize nested
 const nested_rev_autodiff begin_nest;

// make local copies here...

// Perform calculation
var sub_sum_v = apply(
  [&](auto&&... args) {
    return ReduceFunction()(local_sub_slice, r.begin(), r.end() - 1,
                            msgs_, args...);
  },
args_tuple_local);

// Accumulate value of reduce_sum
sum_ += sub_sum_v.val();

// This part needs to be in the reverse pass?

// Compute Jacobian
sub_sum_v.grad();

// Accumulate adjoints of sliced_arguments
accumulate_adjoints(sliced_partials_ + 
  r.begin() * num_vars_per_term_,
  std::move(local_sub_slice));

// Accumulate adjoints of shared_arguments
apply(
  [&](auto&&... args) {
    accumulate_adjoints(args_adjoints_.data(),
        std::forward<decltype(args)>(args)...);
  },
args_tuple_local);

The first thing Iā€™m thinking about is how do we call just the .chain() methods from sub_sum_v up to the first var allocated within this nested instance? Can we just keep track of the starting position of the nested autodiff in this instance of the chainable stack and then have a grad function for each chainable stack like

ChainableStack.instance_.grad(start_ptr, end_ptr);

The nested part of this makes my brain hurt haha

What I did so far is to only avoid the extra overhead we have for doing deep copies by work chunk. The recipe for that is doing deep copies of the shared stuff by thread only onto separate AD tapes. This lays the ground for doing the reverse sweep in parallel instead of doing many nested sweeps for each work item.

My suggestion is to include this stuff first and then extend it to handle full parallel forward & reverse sweeps.

Yes that totally makes sense to me

Great.

Soā€¦ I still donā€™t like ratios of ratios as these are too misleading to me. Maybe thatā€™s only me, but here is a way of plotting the info in overlaid form which makes it clear as well:

Runtimes of the chunking experiment:

Slowdown of chunking experiment (relative to 1 core within each curve):

Speedup comparison:

ā€œdevā€ is the new stuff and ā€œrcā€ the current 2.25.0 rc.

3 Likes

Looks good!

The ratios plotted here are runtime_parallel_code_A / runtime_serial vs runtime_parallel_code_B / runtime_serial. Those are good for understanding what our parallel efficiency is (and were good when we had no other parallel code to compare to).

Separately we could look at runtime_parallel_code_A / runtime_parallel_code_B to know exactly how much faster new parallel is vs. old parallel (which is relevant now since weā€™re replacing one parallel code with another).

Sure. I did plot the runtime for version A & B on top of each other in the last ā€œspeedup comparisonā€ (using different colours). You see that the new code is quite a bit better for the smaller grain sizes while the old code is seemingly a bit faster for a large grainsize, but not much. A ratio should tell the same story.

Iā€™m cool with the graphs you made. The original ones were a bit hard to compare but having stuff side by side makes sense.

What are iterations in the graph though? Like literally running 25 iterations of a model?

Itā€™s 50 iterations of the mixed poisson model used in the brms threading vignette.

How should we proceed with the draft pr? Do we need a design doc here or is the change modest in your view?

Iā€™m fine with the heuristic of iters here though I worry that speed will change a lot through warmup. In the future it may be better to just make its own perf test with gtest

I think itā€™s a modest change and doesnā€™t need a design doc itself. Though personally Iā€™d appreciate a rough sketch of how you want to use this to move to parallel in reverse. Nothing too technical needed just a layout. Looking it over it wasnā€™t clear to me how this would help with that (and so ā€œreverse is a whole separate thingā€ is also a fine answer)

How come? This is all about how many leapfrog steps are done.

Great. I will add some tests then and letā€™s wait for @bbbales2 to have a look.

The plan is as follows: Right now we do nested AD for every work chunk and return things as precomputed_gradient. When doing the reverse sweep in parallel as well things will run as

  1. Run the parallel forward sweep where we merely record onto the scoped chainable stacks the AD tape, but do not run the nested stuff. Thus, we endup with one scoped chainable stack per thread once the forward pass exits reduce_sum.
  2. reduce_sum returns a ā€œsuperā€ vari which stores internally the different scoped chainable stacks.
  3. On the reverse pass this ā€œsuperā€ vari will run grad in parallel on the different scoped chainable stacks.

Right now I donā€™t think this will be hard to doā€¦ until I hit some roadblockā€¦

Per-gradient timings

Argh. My git fu is badā€¦ I messed up the branches a bitā€¦ okā€¦ so we have now two branches:

  • feature-fast-parallel-ad-2 contains the step 1 changes with the scoped chainable stack which does speedup a lot already reduce_sum and we should review and merge this first. Draft PR.

  • feature-fast-parallel-ad now contains an already further developed version of the above which runs the reverse sweep over the parallel region. So this version avoids the used of nested AD and instead leaves the child scopes intact. The reverse sweep is not run in parallel yet, but thatā€™s not hard to add. However, I am not yet entirely happy with the structure of it, but the idea is there. Draft PR is here.

So sorry for the mess with the git history. I still think we should merge feature-fast-parallel-ad-2 first and then work on the more complicated thing.

1 Like

Youā€™ve made a lot of progress while I was away, I should take breaks more often! This is great, Iā€™ll catch up on the code/PRs this week and will have c++ time next week to contribute.

Really looking forward to this!

1 Like