Grainsize, reduce_sum, reduce_sum_static

Hi, I understand that when using reduce_sum, setting grainsize = 1 uses an internal scheduler to decide on the slice size.

But what exactly happens when you use a different grainsize with reduce_sum?

Also, what exactly happens when you use reduce_sum_static?

I never find using grainsize = 1 gives the quickest run. I also fail to get a good pattern as I follow the suggestion in the manual. Could someone give a better explanation? Much appreciated! Thanks.