Nested reduce_sum

grburgess · September 8, 2020, 1:45pm

I’m curious as to what is the best way and what happens when nesting reduce_sum calls.

I have N time series that each have M_n entries and I want to call reduce_sum over N and then inside each time series as the computations are very long.

Is this smart?
what actually happens?
is there some special way to set up the grain size to control this?

Currently I have just been calling reduce_sum manually on each of the M time series, but I have read in other posts that this is not the best way to go.

Thanks!

bbbales2 · September 8, 2020, 7:47pm

So your code looks like this:

for(n in 1:N) {
  for(m in 1:M) {
    target += myfunc(n, m);
  }
}

Is that correct?

And you can parallelize over M like the following and get this to work, right?:

for(n in 1:N) {
  target += parallel_myfunc(n);
}

And then the question is can you just do that last loop with a reduce_sum. I’m pretty sure we tested recursive reduce_sum, so I don’t see any reason it shouldn’t work.

I think it would be preferable to do it all in one loop though (not totally sure). What I mean is convert the initial code to:

for(j in 1:J) {
  target += myfunc(ns[j], ms[j]);
}

and then use one reduce_sum over that. There is overhead in every reduce_sum call, and so minimizing the number of calls is good.

You might also not benefit from more parallelization if the work done parallelizing over M is already enough. So if for each N, the work is already large enough to fill up your computer, then I wouldn’t expect adding more reduce_sum calls to help.

Adding @wds15 cause he might be interested.

wds15 · September 8, 2020, 8:09pm

Nothing really to add here.

The only thing to mention is that the overhead due to reduce_sum calls gets less important the more intense your function is.

On way to find out about the grainsize is to try it out. The Intel TBB folks give advise on how to choose it (I posted the link a few times and google is your friend).

To see the overhead due to chunking you can use reduce_sum_static and setup the grainsize such that you will calculate 1, 2, 4, 8, … chunks of partial sums. Running this with just 1 cores should show you the overhead you create by smaller partial sums. Nesting reduce_sum makes the tuning game just more fun…

grburgess · September 9, 2020, 1:46pm

Thanks. Yes, I read a bit on TBB a while back. It is not always intuitive how TBB and the AD tape work together. But, I’ve tried the nested version and it is really interesting. Since I’m calling many more operations at once… there is a huge speed gain with the number of cores, but it is pretty sensitive the grain size combinations. I’m not sure how much this generalizes… but if anyone in the future is curious:

On the other “group” loop with few objects, grainsize=1 seems to be the best.
For the inner loop, grainsize=(~1-2) *n_objects/n_threads seems to be the best… but it is likely cache dependent.

torkar · September 9, 2020, 1:51pm

I think @wds15’s report here is worth a read and an eye-opener. Download the zip file and have a read :)

Topic		Replies	Views
Grainsize when (chunks/cores) < 2 Modeling	7	556	March 11, 2021
Understanding reduce_sum efficiency Modeling	10	845	March 22, 2021
Reduce_sum performance Modeling performance , paralellization	15	1342	September 28, 2020
Trying within-chain parallelization with reduce_sum increases runtime a lot Modeling paralellization	6	954	August 2, 2021
Grainsize, reduce_sum, reduce_sum_static Modeling	1	537	July 17, 2021

Nested reduce_sum

Related topics