I’d say about 90% of the changes are just cleaning up templates and the thing above. As well as using eigen and algorithm stuff when we need to iterate over containers
Sounds great. I wasn’t sure if we need a design-doc for this, given that we are merely adding a function here and do not throw the AD tape into the air… but I agree in that the function will have quite some impact overall (people will write models differently).
Anyway. Today I wanted to revisit the examples posted above. Here are my findings (see below for more details on what is run):
- nbme: This is merely a
bernoulli_logit_lpmfwhich is computationally way too cheap. Thus, this thing is super hard to speed up, since each log-lik calculation is so fast. I don’t quite think that we have to make such a model our target to optimise. - base: a serial version takes 80s / 1 core reduce_sum 67s / 2 cores 37s / 4 cores 24 => relative to 1 core reduce sum we get 1.8x for 2 cores and 2.8x for 4 cores
- bpl: I used 50 replications and grainsize = 500; serial 44s / 1 core reduce_sum 27s / 2 cores 15s / 4 core 11s => relative to 1 core reduce sum speedups are 1.8x for 2 cores and 2.5x for 4 cores
The math branch I used is the newest one from @stevebronder. The serial version / reduce_sum version is defined like this in the Stan code:
if(grainsize < 0) {
// serial version
target += functor(1, N, ...);
} else {
// reduce_sum thing
target += reduce_sum(functor, sliced_container, grainsize, ...);
}
So using reduce_sum already speeds up things on its own which is due to reduction in the AD tree size. At least this is what happened for map_rect… BTW… the apply_vector_reduce framework could be tested with compressing the AD tree by running a nested AD sweep… just and idea for @andrjohns … I would expect this to speedup things; or at least it’s worth a try, I think.
I should probably compare this to the state of affairs before the big refactor from @stevebronder to see how much that bought us… or we just go with the new stuff @stevebronder (I don’ have any opinion on that ATM).
I’ll do some benching tomorrow to get a sense of the before-after as well.
Here are the numbers I got in a before after benchmarking of the problems:
bpl_parallel, grainsize = 125, rep = 2
old
8 threads = 10s
1 thread = 30s
new
8 threads = 10s
1 thread = 28s
nbme6, N_subset = 2000, grainsize = 125
old
8 threads = 9s
1 thread = 20s
new
8 threads = 9s
1 thread = 19s
base, N_subset = 1600, grainsize = 125
old
8 threads = 75s
1 thread = 170s
new
8 threasd = 65s
1 thread = 165s
It was an improvement in a few places for sure.
Yeah but we can always increase grainsize until computation the amount of work is large?
No, I don’t think so that this is always as easy possible. There is a lot of memory pressure ongoing in the example which you brought up. So the grainsize helps to control the amount of work, but if it also means that the memory which has to be moved explodes it’s not going to help too much.
The nbme example with the bernoullig_logit_lpmf could scale by adjusting the grainsize if you would have this case (I think):
- You don’t have continuous, but only categorical predictors which you lump together
- You aggregate by group and for each aggregated group the mean does not need to be calculated for every data row, but only for the group (so that the data y for a group stays as a longer array, but only a single mean is needed to be computed for that).
The amount of memory grows too fast in this example as compared to the computational cost. I have read this now a few times: A CPU can be considered infinitely fast…until you have to read from/write to the memory which is orders of magnitude slower.
Also, when you do these test… please test 2 cores and 4 cores as well (I personally do not care so much about 8 threads - just look at Amdahl’s law and you see that getting 4x speedup is already good if there is some serial portion of code around which is not in the parallel region).
BTW, I used this bpl implementation which does slicing of Y … argh on a different laptop; will edit this post…
Why a grainsize of 125? From the article here
A rule of thumb is that grainsize iterations of operator() should take at least 100,000 clock cycles to execute. For example, if a single iteration takes 100 clocks, then the grainsize needs to be at least 1000 iterations. When in doubt, do the following experiment:
A mulltiply of two doubles is only 2-7 cycles (there’s a table here with some ballparks).
I found better results when I had a grainsize of 8-32
@wds15 did you try slapping auto_partitioner in parallel_reduce instead of having the user pick it? It looks like it does clever things
I tried a few things. The auto partitioner is what is used as a default - so that is currently in use. What I thought was interesting is the affinity_partitioner which memorizes which work chunk gets assigned to which processor to improve memory locality. So upon repeated call of the function you get the same work chunk to processor assignment, which is why you need to declare the thing as static in the function if you use. I was expecting this to be useful for Stan programs, but it turns out it is not so useful.
You’re right. I think I was starting too high. The “best” number seems to be pretty finicky, but it is small. For all of these I played around to find the best number. This’ll be good advice for the doc.
The scaling with data size results here are weird cause the problem with more data. Ofc. the single core vs. parallel nbme results don’t make sense.
Here are more performance numbers. I think I’m kinda encouraged by these numbers. nbme6 and base do look like they’re scaling better for bigger problems, which is what we’d want. bpl not really, but it isn’t falling through the floor or anything.
Edit: Sigh, actually it doesn’t look like nbme is getting better with larger problems. I guess this is why you make plots of data like this an automate it. I don’t wanna do that right now though.
bpl_parallel, grainsize = 8, 200 warmup, 100 samples
rep = 2
8 threads = 1.4s
4 thredas = 1.8s
1 thread = 4.5s
serial = 5s
rep = 10
8 threads = 6.7s
4 threads = 8.2s
1 thread = 24s
serial = 28s
rep = 20
8 threads = 14s
4 threads = 15s
1 thread = 46s
nbme6, grainsize = 75, 200 warmup, 100 samples
N_subset = 2000
8 threads = 1.6s
1 thread = 5s
serial = 4s
N_subset = 8000
8 threads = 4.8s
1 thread = 13s
serial = 12s
N_subset = 20000
8 threads = 13s
1 thread = 30s
base, grainsize = 32, 200 warmup, 100 samples
N_subset = 800
8 threads = 6s
4 threads = 6s
1 thread = 14s
serial = 12s
N_subset = 1600
8 threads = 12s
4 threads = 13s
1 thread = 33s
N_subset = 3200
8 threads = 57s
4 threads = 72s
1 thread = 210s
For nbme I added a grainsize data param (I set it to 16) and tried with N_subset = 20000 with 16 threads (I have 32 total) for default 1000 warmup and 1000 samples
For nbme6 the above
real 68.88
user 1072.58
sys 13.14
Then serial nbme
real 276.25
user 276.09
sys 0.10
So 276.25/68.88 is about 4x. We should probs start using perf to see what other parts of this we can parallelize
Oh man I feel kinda owned. 8 core putt-putt-ptooo.
Yeah I would be very interested to see this.
When I was putting together those numbers in the table above it really seems like something that we should be doing more methodically.
Do you know if perf handles multiple threads correctly? Like are your performance counters just from one thread or are they from all of them? I was using google-perf-tools and I think it only catches one thread (not sure if it’s a particular one or not)
If you set it like below I’m pretty sure it does
STAN_NUM_THREADS=32 sudo perf record -g --freq=max --call-graph dwarf -d --phys-data --per-thread ./examples/bpl/bpl_parallel data file='./examples/bpl/bpl.data.R' sample
then you can get the call graph with
sudo perf report --call-graph --stdio -G
Oh yeah it has an argument --per-thread so I assume that means it knows what a thread is.
I am not yet convinced that we can seriously speedup nbme given that it’s a very cheap bernoulli_logit_lpmf which is calculated. The CPU time needed to calculate that lpmf is barely noticeable, yet you have to copy a lot of stuff around.
However, I did code up a version of nbme which packs the bulk of the parameters into the slicing argument. That should improve things a lot… but right now this version does not work at all due to a bug in the reduce_sum code which prevents the partials from being propagated and collected correctly for type of vector[] big_container[]. Maybe I get to that later today…let’s see.
Fair. It looks mostly like simple, memory intensive stuff too. (edit: oh you already said that got it)
The actual problem has N = 160k (instead of 40k). Discourse wouldn’t let me upload the generated file so here’s the stuff to generate it.
fakedata.csv (3.5 MB) nbme_fake.R (1.4 KB)
Now that the bugs with slicing of var’s are fixed with nested containers, I was able to run a modified nbme model which packs all the parameters into the slicing structure. This should improve the efficiency and here is what I get on there entire data set running grainsize=0 and 25 warmups + 25 iterations:
rs-1.log:real 2m24.832s 1 core
rs-2.log:real 1m24.716s 2 cores
rs-4.log:real 0m59.005s 4 cores
> r <- c(2*60+25, 60 + 24, 59)
> r
[1] 145 84 59
> r / r[1]
[1] 1.0000000 0.5793103 0.4068966
> r[1] / r
[1] 1.000000 1.726190 2.457627
>
Here is the model:
nbme6_sw1.stan (3.5 KB)
The new model needs a sorted data set which is done with this R script:
nbme.R (1.1 KB)
Looks much better to me (at the cost of a bit more cumbersome coding).
Can you run a quick 25/25 comparison to the nbme6.stan code on at least 4 cores so we have a reference?
Good point - need a baseline!
So running the old nbme on the entire data set, grainsize=0 and 25/25 gives me:
rsOld-1.log:real 2m59.621s
rsOld-2.log:real 1m40.325s
rsOld-4.log:real 1m8.948s
> rOld <- c( 1 * 60 + 9 , 1 * 60 + 40, 3 * 60)
> rOld / rOld[1]
[1] 1.000000 1.449275 2.608696
So the new code is faster, but it’s not a huge jump.
I was just doing some experiments with MPI on this. I think this code is seriously memory bound.
I was doing 4 MPI chains and they took 60 seconds for warmup and then I did 8 MPI chains and it took like 120 seconds.
I guess those MPI chains do interact with adaptation, but seems more likely that it’s bottlenecked on memory. We’re probably capped on this problem just by memory.
Did the ODE solves scale well?
Edit: Lemme be specific – talk about the nbme1 model on @yizhang’s cross chain warmup branch with the MPI stuff
YES! This is what I keep saying. This problem is too much memory pressure vs computation. I think if you would make the problem a binomial log link (so add a fake number of cases per data-row), then it would scale better.
ODE results:
RK45 integrator
threads-rs-1.log:real 4m23.146s 1 core
threads-rs-2.log:real 2m14.048s 2 cores
threads-rs-3.log:real 1m51.394s 3 cores
threads-rs-4.log:real 1m41.745s 4 cores
BDF integrator (same problem):
threads-rsBDF-1.log:real 27m14.601s
threads-rsBDF-2.log:real 14m13.856s
threads-rsBDF-3.log:real 9m56.502s
threads-rsBDF-4.log:real 8m31.021s
I used my Stanton Helsinki 18 contribution. Maybe one needs more subjects to make it scale even better (right now it’s only 32 subjects).