Parallel autodiff v4

I’d say about 90% of the changes are just cleaning up templates and the thing above. As well as using eigen and algorithm stuff when we need to iterate over containers

Sounds great. I wasn’t sure if we need a design-doc for this, given that we are merely adding a function here and do not throw the AD tape into the air… but I agree in that the function will have quite some impact overall (people will write models differently).

Anyway. Today I wanted to revisit the examples posted above. Here are my findings (see below for more details on what is run):

  • nbme: This is merely a bernoulli_logit_lpmf which is computationally way too cheap. Thus, this thing is super hard to speed up, since each log-lik calculation is so fast. I don’t quite think that we have to make such a model our target to optimise.
  • base: a serial version takes 80s / 1 core reduce_sum 67s / 2 cores 37s / 4 cores 24 => relative to 1 core reduce sum we get 1.8x for 2 cores and 2.8x for 4 cores
  • bpl: I used 50 replications and grainsize = 500; serial 44s / 1 core reduce_sum 27s / 2 cores 15s / 4 core 11s => relative to 1 core reduce sum speedups are 1.8x for 2 cores and 2.5x for 4 cores

The math branch I used is the newest one from @stevebronder. The serial version / reduce_sum version is defined like this in the Stan code:

if(grainsize < 0)  {
  // serial version
  target += functor(1, N, ...);
} else {
  // reduce_sum thing
  target += reduce_sum(functor, sliced_container, grainsize, ...);
}

So using reduce_sum already speeds up things on its own which is due to reduction in the AD tree size. At least this is what happened for map_rect… BTW… the apply_vector_reduce framework could be tested with compressing the AD tree by running a nested AD sweep… just and idea for @andrjohns … I would expect this to speedup things; or at least it’s worth a try, I think.

I should probably compare this to the state of affairs before the big refactor from @stevebronder to see how much that bought us… or we just go with the new stuff @stevebronder (I don’ have any opinion on that ATM).

I’ll do some benching tomorrow to get a sense of the before-after as well.

Here are the numbers I got in a before after benchmarking of the problems:

bpl_parallel, grainsize = 125, rep = 2
old
8 threads = 10s
1 thread = 30s
new
8 threads = 10s
1 thread = 28s

nbme6, N_subset = 2000, grainsize = 125
old
8 threads = 9s
1 thread = 20s
new
8 threads = 9s
1 thread = 19s

base, N_subset = 1600, grainsize = 125
old
8 threads = 75s
1 thread = 170s
new
8 threasd = 65s
1 thread = 165s

It was an improvement in a few places for sure.

Yeah but we can always increase grainsize until computation the amount of work is large?

No, I don’t think so that this is always as easy possible. There is a lot of memory pressure ongoing in the example which you brought up. So the grainsize helps to control the amount of work, but if it also means that the memory which has to be moved explodes it’s not going to help too much.

The nbme example with the bernoullig_logit_lpmf could scale by adjusting the grainsize if you would have this case (I think):

  • You don’t have continuous, but only categorical predictors which you lump together
  • You aggregate by group and for each aggregated group the mean does not need to be calculated for every data row, but only for the group (so that the data y for a group stays as a longer array, but only a single mean is needed to be computed for that).

The amount of memory grows too fast in this example as compared to the computational cost. I have read this now a few times: A CPU can be considered infinitely fast…until you have to read from/write to the memory which is orders of magnitude slower.

Also, when you do these test… please test 2 cores and 4 cores as well (I personally do not care so much about 8 threads - just look at Amdahl’s law and you see that getting 4x speedup is already good if there is some serial portion of code around which is not in the parallel region).

BTW, I used this bpl implementation which does slicing of Y … argh on a different laptop; will edit this post…

Why a grainsize of 125? From the article here

A rule of thumb is that grainsize iterations of operator() should take at least 100,000 clock cycles to execute. For example, if a single iteration takes 100 clocks, then the grainsize needs to be at least 1000 iterations. When in doubt, do the following experiment:

A mulltiply of two doubles is only 2-7 cycles (there’s a table here with some ballparks).

I found better results when I had a grainsize of 8-32

@wds15 did you try slapping auto_partitioner in parallel_reduce instead of having the user pick it? It looks like it does clever things

https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/partitioners/auto_partitioner_cls.html

I tried a few things. The auto partitioner is what is used as a default - so that is currently in use. What I thought was interesting is the affinity_partitioner which memorizes which work chunk gets assigned to which processor to improve memory locality. So upon repeated call of the function you get the same work chunk to processor assignment, which is why you need to declare the thing as static in the function if you use. I was expecting this to be useful for Stan programs, but it turns out it is not so useful.

1 Like

You’re right. I think I was starting too high. The “best” number seems to be pretty finicky, but it is small. For all of these I played around to find the best number. This’ll be good advice for the doc.

The scaling with data size results here are weird cause the problem with more data. Ofc. the single core vs. parallel nbme results don’t make sense.

Here are more performance numbers. I think I’m kinda encouraged by these numbers. nbme6 and base do look like they’re scaling better for bigger problems, which is what we’d want. bpl not really, but it isn’t falling through the floor or anything.

Edit: Sigh, actually it doesn’t look like nbme is getting better with larger problems. I guess this is why you make plots of data like this an automate it. I don’t wanna do that right now though.

bpl_parallel, grainsize = 8, 200 warmup, 100 samples
rep = 2
8 threads = 1.4s
4 thredas = 1.8s
1 thread = 4.5s
serial = 5s

rep = 10
8 threads = 6.7s
4 threads = 8.2s
1 thread = 24s
serial = 28s

rep = 20
8 threads = 14s
4 threads = 15s
1 thread = 46s
nbme6, grainsize = 75, 200 warmup, 100 samples
N_subset = 2000
8 threads = 1.6s
1 thread = 5s
serial = 4s

N_subset = 8000
8 threads = 4.8s
1 thread = 13s
serial = 12s

N_subset = 20000
8 threads = 13s
1 thread = 30s
base, grainsize = 32, 200 warmup, 100 samples
N_subset = 800
8 threads = 6s
4 threads = 6s
1 thread = 14s
serial = 12s

N_subset = 1600
8 threads = 12s
4 threads = 13s
1 thread = 33s

N_subset = 3200
8 threads = 57s
4 threads = 72s
1 thread = 210s
1 Like

For nbme I added a grainsize data param (I set it to 16) and tried with N_subset = 20000 with 16 threads (I have 32 total) for default 1000 warmup and 1000 samples

For nbme6 the above

real 68.88
user 1072.58
sys 13.14

Then serial nbme

real 276.25
user 276.09
sys 0.10

So 276.25/68.88 is about 4x. We should probs start using perf to see what other parts of this we can parallelize

Oh man I feel kinda owned. 8 core putt-putt-ptooo.

Yeah I would be very interested to see this.

When I was putting together those numbers in the table above it really seems like something that we should be doing more methodically.

Do you know if perf handles multiple threads correctly? Like are your performance counters just from one thread or are they from all of them? I was using google-perf-tools and I think it only catches one thread (not sure if it’s a particular one or not)

If you set it like below I’m pretty sure it does

STAN_NUM_THREADS=32 sudo perf record -g --freq=max --call-graph dwarf -d --phys-data --per-thread ./examples/bpl/bpl_parallel data file='./examples/bpl/bpl.data.R' sample

then you can get the call graph with

sudo  perf report --call-graph --stdio -G
2 Likes

Oh yeah it has an argument --per-thread so I assume that means it knows what a thread is.

I am not yet convinced that we can seriously speedup nbme given that it’s a very cheap bernoulli_logit_lpmf which is calculated. The CPU time needed to calculate that lpmf is barely noticeable, yet you have to copy a lot of stuff around.

However, I did code up a version of nbme which packs the bulk of the parameters into the slicing argument. That should improve things a lot… but right now this version does not work at all due to a bug in the reduce_sum code which prevents the partials from being propagated and collected correctly for type of vector[] big_container[]. Maybe I get to that later today…let’s see.

Fair. It looks mostly like simple, memory intensive stuff too. (edit: oh you already said that got it)

The actual problem has N = 160k (instead of 40k). Discourse wouldn’t let me upload the generated file so here’s the stuff to generate it.

fakedata.csv (3.5 MB) nbme_fake.R (1.4 KB)

Now that the bugs with slicing of var’s are fixed with nested containers, I was able to run a modified nbme model which packs all the parameters into the slicing structure. This should improve the efficiency and here is what I get on there entire data set running grainsize=0 and 25 warmups + 25 iterations:

rs-1.log:real	2m24.832s 1 core
rs-2.log:real	1m24.716s 2 cores
rs-4.log:real	0m59.005s 4 cores

> r  <- c(2*60+25, 60 + 24, 59) 
> r
[1] 145  84  59
> r / r[1]
[1] 1.0000000 0.5793103 0.4068966
> r[1] / r
[1] 1.000000 1.726190 2.457627
> 

Here is the model:

nbme6_sw1.stan (3.5 KB)

The new model needs a sorted data set which is done with this R script:

nbme.R (1.1 KB)

Looks much better to me (at the cost of a bit more cumbersome coding).

Can you run a quick 25/25 comparison to the nbme6.stan code on at least 4 cores so we have a reference?

Good point - need a baseline!

So running the old nbme on the entire data set, grainsize=0 and 25/25 gives me:

rsOld-1.log:real	2m59.621s
rsOld-2.log:real	1m40.325s
rsOld-4.log:real	1m8.948s

> rOld <- c( 1 * 60 + 9 , 1 * 60 + 40, 3 * 60)
> rOld / rOld[1]
[1] 1.000000 1.449275 2.608696

So the new code is faster, but it’s not a huge jump.

I was just doing some experiments with MPI on this. I think this code is seriously memory bound.

I was doing 4 MPI chains and they took 60 seconds for warmup and then I did 8 MPI chains and it took like 120 seconds.

I guess those MPI chains do interact with adaptation, but seems more likely that it’s bottlenecked on memory. We’re probably capped on this problem just by memory.

Did the ODE solves scale well?

Edit: Lemme be specific – talk about the nbme1 model on @yizhang’s cross chain warmup branch with the MPI stuff

YES! This is what I keep saying. This problem is too much memory pressure vs computation. I think if you would make the problem a binomial log link (so add a fake number of cases per data-row), then it would scale better.

ODE results:

RK45 integrator

threads-rs-1.log:real	4m23.146s 1 core
threads-rs-2.log:real	2m14.048s 2 cores
threads-rs-3.log:real	1m51.394s 3 cores
threads-rs-4.log:real	1m41.745s 4 cores

BDF integrator (same problem):
threads-rsBDF-1.log:real	27m14.601s
threads-rsBDF-2.log:real	14m13.856s
threads-rsBDF-3.log:real	9m56.502s
threads-rsBDF-4.log:real	8m31.021s

I used my Stanton Helsinki 18 contribution. Maybe one needs more subjects to make it scale even better (right now it’s only 32 subjects).

2 Likes