Hi @rok_cesnovar (or whoever wrote the parallelization section in the manual for reduce_sum),
when using reduce_sum_static and a manual setting for grainsize, does it make any sense to increase the grainsize when N terms / M cores < 2? That is, suppose there are 50 chunks and 35 cores. Is grainsize=1 necessarily the upper limit for efficiency given that there aren’t more than 1 chunk to fit in a core? Or is it still worth experimenting with larger grainsizes?
Its not a guarantee that all 35 cores will be used, with STAN_NUM_THREADS=35 we merely say 35 is the the maximum cores we wish to occupy. The actual scheduling is left to the TBB scheduler.
Larger chunks can be better in the case where grainsize=1 splits the work in too small pieces and the overhead of creating/starting/copying to threads/tasks is non-negligible compared to actual work.
So I would say its unlikely anything above 1 will be useful but could happen. I understand this isnt really useful advice though, but this st
I think @wds15 wrote that section, I merely reorganized it a bit recently.
Ok so if the computational overhead were really high such that using all 35 cores would actually be slower than using 15 cores with more chunks per core, it could be better to have a grainsize > 1. But it implies that the reduce_sum is pretty inefficient.
I would not say that
reduce_sum is pretty inefficient (that’s just the usual overhead due to parallelisation). First, make sure you use 2.26.1 which contains improvements to it. It can still make sense to have more cores than chunks to work on in case you nest reduce_sum calls, for example (I would not recommend that, but it can happen).
For the overhead story, have a look here:
I think what @saudiwin was trying to say that in that case the use of reduce_sum is ineffcient. Like doing a vector addition on 20 elements with 16 threads/tasks.
Yes exactly! My apologies @wds15 if you thought I was criticizing the code. I’m just trying to wrap my mind around how the scheduler handles varying levels of overhead given more or less efficient parameterizations of models.
grainsize = 1 does the scheduler learn the grainsize as it goes, or does it come up with some heuristic grainsize at the outset and stick with it?
I ask because if it takes the scheduler a little while to settle into an optimal grainsize, then I worry that many of my attempts to check the efficiency of
grainsize = 1 have suffered from examining runs that are too short to allow the scheduler to find a good grainsize.
The reduce sum call does not store anything on scheduling between repeated calls. As Long as things are well repeatable (fixed seed) short runs should be fine.