I’m curious as to what is the best way and what happens when nesting reduce_sum calls.

I have N time series that each have M_n entries and I want to call reduce_sum over N and then inside each time series as the computations are very long.

Is this smart?

what actually happens?

is there some special way to set up the grain size to control this?

Currently I have just been calling reduce_sum manually on each of the M time series, but I have read in other posts that this is not the best way to go.

for(n in 1:N) {
for(m in 1:M) {
target += myfunc(n, m);
}
}

Is that correct?

And you can parallelize over M like the following and get this to work, right?:

for(n in 1:N) {
target += parallel_myfunc(n);
}

And then the question is can you just do that last loop with a reduce_sum. I’m pretty sure we tested recursive reduce_sum, so I don’t see any reason it shouldn’t work.

I think it would be preferable to do it all in one loop though (not totally sure). What I mean is convert the initial code to:

for(j in 1:J) {
target += myfunc(ns[j], ms[j]);
}

and then use one reduce_sum over that. There is overhead in every reduce_sum call, and so minimizing the number of calls is good.

You might also not benefit from more parallelization if the work done parallelizing over M is already enough. So if for each N, the work is already large enough to fill up your computer, then I wouldn’t expect adding more reduce_sum calls to help.

The only thing to mention is that the overhead due to reduce_sum calls gets less important the more intense your function is.

On way to find out about the grainsize is to try it out. The Intel TBB folks give advise on how to choose it (I posted the link a few times and google is your friend).

To see the overhead due to chunking you can use reduce_sum_static and setup the grainsize such that you will calculate 1, 2, 4, 8, … chunks of partial sums. Running this with just 1 cores should show you the overhead you create by smaller partial sums. Nesting reduce_sum makes the tuning game just more fun…

Thanks. Yes, I read a bit on TBB a while back. It is not always intuitive how TBB and the AD tape work together. But, I’ve tried the nested version and it is really interesting. Since I’m calling many more operations at once… there is a huge speed gain with the number of cores, but it is pretty sensitive the grain size combinations. I’m not sure how much this generalizes… but if anyone in the future is curious:

On the other “group” loop with few objects, grainsize=1 seems to be the best.
For the inner loop, grainsize=(~1-2) *n_objects/n_threads seems to be the best… but it is likely cache dependent.