Hi, I understand that when using reduce_sum, setting grainsize = 1 uses an internal scheduler to decide on the slice size.
But what exactly happens when you use a different grainsize with reduce_sum?
Also, what exactly happens when you use reduce_sum_static?
I never find using grainsize = 1 gives the quickest run. I also fail to get a good pattern as I follow the suggestion in the manual. Could someone give a better explanation? Much appreciated! Thanks.