Reduce_sum: choosing how to split data across shards

I’m using reduce_sum for within-chain parallelization of a hierarchical model where one factor (with many levels) is the grouping term for multiple random effects. It occurs to me that if I split the data (approximately or exactly) along the breaks in the grouping level, then each shard would only need to work with a small subset of the random-effect parameters.

  1. Is splitting the data in this way (as opposed to passing out arbitrary chunks of data that slice across numerous random effect levels) likely to yield improved performance?
  2. If so, do I need to do anything in my code to inform reduce_sum which levels of a random effect are relevant for a given shard? Because there are so many random effect levels and I have access to a large number of cores, I’m concerned not only about the speed of the differentiation but also the overhead of passing all of those parameters to each worker.

reduce_sum is really build to avoid having to deal with sharding. Have you run your model already? If you think you need to do better, then just play with the grainsize. Only if that does not help, think about changing how you code it.

Splitting hierarchical models along the groupings, makes sense though… but is often harder to code which is why I very much encourage everyone to start simple (you still should get nice speed gains).

EDIT: I really designed reduce_sum so that people can forget about sharding… and its again the first question I get… sigh…

2 Likes

Thanks so much for the quick reply. I’m currently playing around with various toy versions of the model that I will ultimately fit, and as I scale up I’m looking to milk every bit of efficiency that I can out of the implementation.

I wasn’t aware that I should use some terminology other than ‘sharding’ to describe what reduce_sum does–thanks for that clarification.

I’ll note that if I use reduce_sum, I can still sort the data so that each thread deals with a minimum number of the random effect levels, but it’s unclear to me whether that matters–both whether it would matter for map_rect and if so whether reduce_sum is built to take advantage of whatever speedup is available.

EDIT: I think I’m spot in my learning about Stan where I had just about managed to grasp map_rect, and then reduce_sum came out (it’s fantastic by the way), and so I’m slowly updating my thinking about the problem from a map_rect mindset to a reduce_sum mindset.

No worries.

My advice:

  • get it working correctly first
  • play a bit with the grainsize in case each unit of computation is very small (for example a single bernoulli_logit likelihood)… otherwise the automatism is usually just fine from my experience
  • if you use reduce_sum_static then the grainsize must be chosen sensibly
  • using the hierarchical grouping structure does make sense
  • pack as many parameters as you can into the first slicing argument (but I would not do that at the price of readability of your program)… reduce_sum was build to avoid packing/unpacking

When you used to work with map_rect and put though into sharding… really you can forget about sharding business! The size of the partial sums are chosen in full automation by the Intel TBB library.

I am glad you like it… I would be curious what is your speedup finally.