Grainsize when (chunks/cores) < 2

saudiwin · March 11, 2021, 7:44am

Hi @rok_cesnovar (or whoever wrote the parallelization section in the manual for reduce_sum),

when using reduce_sum_static and a manual setting for grainsize, does it make any sense to increase the grainsize when N terms / M cores < 2? That is, suppose there are 50 chunks and 35 cores. Is grainsize=1 necessarily the upper limit for efficiency given that there aren’t more than 1 chunk to fit in a core? Or is it still worth experimenting with larger grainsizes?

Thanks much,

Bob

rok_cesnovar · March 11, 2021, 9:43am

Its not a guarantee that all 35 cores will be used, with STAN_NUM_THREADS=35 we merely say 35 is the the maximum cores we wish to occupy. The actual scheduling is left to the TBB scheduler.

Larger chunks can be better in the case where grainsize=1 splits the work in too small pieces and the overhead of creating/starting/copying to threads/tasks is non-negligible compared to actual work.

So I would say its unlikely anything above 1 will be useful but could happen. I understand this isnt really useful advice though, but this st

I think @wds15 wrote that section, I merely reorganized it a bit recently.

saudiwin · March 11, 2021, 10:40am

Ok so if the computational overhead were really high such that using all 35 cores would actually be slower than using 15 cores with more chunks per core, it could be better to have a grainsize > 1. But it implies that the reduce_sum is pretty inefficient.

wds15 · March 11, 2021, 12:54pm

I would not say that reduce_sum is pretty inefficient (that’s just the usual overhead due to parallelisation). First, make sure you use 2.26.1 which contains improvements to it. It can still make sense to have more cores than chunks to work on in case you nest reduce_sum calls, for example (I would not recommend that, but it can happen).

For the overhead story, have a look here:

https://cran.r-project.org/web/packages/brms/vignettes/brms_threading.html

rok_cesnovar · March 11, 2021, 1:10pm

I think what @saudiwin was trying to say that in that case the use of reduce_sum is ineffcient. Like doing a vector addition on 20 elements with 16 threads/tasks.

saudiwin · March 11, 2021, 1:34pm

Yes exactly! My apologies @wds15 if you thought I was criticizing the code. I’m just trying to wrap my mind around how the scheduler handles varying levels of overhead given more or less efficient parameterizations of models.

jsocolar · March 11, 2021, 3:22pm

Relatedly, with grainsize = 1 does the scheduler learn the grainsize as it goes, or does it come up with some heuristic grainsize at the outset and stick with it?

I ask because if it takes the scheduler a little while to settle into an optimal grainsize, then I worry that many of my attempts to check the efficiency of reduce_sum with grainsize = 1 have suffered from examining runs that are too short to allow the scheduler to find a good grainsize.

wds15 · March 11, 2021, 7:27pm

The reduce sum call does not store anything on scheduling between repeated calls. As Long as things are well repeatable (fixed seed) short runs should be fine.

Topic		Replies	Views
Understanding reduce_sum efficiency Modeling	10	845	March 22, 2021
Grainsize, reduce_sum, reduce_sum_static Modeling	1	537	July 17, 2021
Stan significantly slower after incorporating multithreading? CmdStan paralellization	6	832	May 3, 2023
Possible confusion around 'grainsize' argument for 'reduce_sum' Modeling	2	519	September 4, 2020
Nested reduce_sum General paralellization	4	585	September 9, 2020

Grainsize when (chunks/cores) < 2

Related topics