Understanding reduce_sum efficiency

Hi all,
I’m working with a large model (~200K parameters) that is breaking some of my intuitions about reduce_sum. I can share the model and data if necessary, but I think it might be beside the point.

The behavior that I find counterintuitive is the combination of the following:

  • Increasing the number of cores provides very modest speedup beyond two cores per chain. If I reduce_sum with grainsize = 1, I actually see slowdown beyond two cores. If I reduce_sum_static with any of a range of reasonable grainsize, I get speedups beyond two cores, but they’re small, and only through very judicious choice of grainsize can I do (barely) better on three cores than what I get on two cores with regular reduce_sum and grainsize = 1. This doesn’t necessarily surprise me–the parameter vector is humongous, so it makes sense that the overhead for parallelization would be large. However…
  • When run on a single core, reduce_sum_static does not get appreciably slower as I decrease grainsize (even by two orders of magnitude!). I was under the impression that testing different grainsizes on a single core was a good way to assess the overhead (see here: Running brms models with within-chain parallelization), but in my case it doesn’t seem to reveal a big overhead burden.
  • At the same time, profiling indicates that the call to reduce_sum is responsible for essentially all of the wall time, so the lack of speed gains appears to be an overhead issue rather than an Amdahl’s Law issue.

Does anyone have any insights as to what might be limiting my speedups, despite the latter two bullets above?

3 Likes

@wds15

1 Like

What Stan version do you use?

The bottleneck for reduce_sum are the shared parameters. If you can move the per-block parameters to the first argument which gets sliced, then that‘s better.

Other than that… showing the model in a simple as possible way is usually helpful to say more.

EDIT; With Stan 2.26.0 reduce_sum was refactored in order to be less sensitive to the shared/slice parameter thing… and it is as far as I can see from brms programs. So an easy solution for you could be to go to 2.26.1 if you have not done so already.

3 Likes

Thanks for the look!

I’m already on Stan 2.26.1, and I’ve noticed myself that reduce_sum is quite insensitive to shared vs. sliced data (maybe this was always true), but I haven’t looked too closely at the parameters. My response is integer and I was slicing that (so it’s awkward to pass the integer response and the real parameters in a single container for slicing; I think it would require calling a routine to coerce reals to integers). An alternative might be to slice my longest parameter (a random intercept with ~200K levels; the next largest parameter has ~50K levels), then use some clever indexing to chunk up the data appropriately. I imagine I can do better with clever sharding in map_rect but I don’t want to go there.

I don’t expect anyone to dig through my model itself (Stan file is ~600 lines; I built the model incrementally but I didn’t notice at what point in my model development reduce_sum stopped yielding big gains). I will try to tinker a bit and see if I can get a manageable reprex, but I’m not necessarily optimistic. If anybody wants to see it, the most human-readable version of the model that I have (I have another version that is slightly faster but much harder to follow) is here colombiaBeta/occupancy_v7.stan at master · jsocolar/colombiaBeta · GitHub. Data are shareable, but only via private(ish) channels. If anybody does dig into the model, one thing to notice is that although the likelihood is cheap (Bernoulli), it is not vectorizable. The data are roughly 600K rows, and by far the longest parameter is b0_spCl, a random intercept term with about 200K levels.

In the mean time, I’m curious whether you have any abstract insights about what else, beyond Amdahl’s law and overhead detectable by running decreasing grainsize on a single core, can stand in the way of speed-ups in general.

Lastly, I’ll mention that that the model is just barely “fast enough” as is, so it’s not a huge deal to me if I don’t get to the bottom of this.

1 Like

Where you put data won‘t really matter.

… but if you have 200k intercepts then slice over them in the first argument of reduce_sum!

2 Likes

If I may, I was wondering about this same headline except in the context of aggregated models, such as continuous models with weights or count models with trials or exposures. In theory these allow for more efficient sampling by alerting the sampler that many observations have identical characteristics, so there is no need to consider them one by one. On the other hand, I could envision a situation where, if the scheduler is not sensitive to this, a few threads could end up with highly-imbalanced numbers of observations even if similar numbers of rows were being assigned to each thread.

Generally speaking, could the use of aggregation inside a model affect (perhaps even in a good way) the efficiency with which reduce_sum tackles these tasks or divides them among various threads, and if so, are there any general best practices with grainsize, dataset organization / row randomization, or anything similar that you recommend we keep in mind as a result?

Have a look at the last bullet here:

https://cran.r-project.org/web/packages/brms/vignettes/brms_threading.html#quick-summary

Do not sort your data!

Bad example: Slicing over groups of things and sorting the data set by number of observations per group.
Good example: Slice over groups of a hierarchical model and have the data set be sorted group-by-group and have the groups occur in random order.

Could we illustrate what you mean with an example? I understand the advice about slicing over groups but, focusing purely on how the rows are arranged within the dataset, I don’t understand how the groups would then be sorted “group-by-group” “in random order” in the dataset to benefit from properly slicing over groups of things. I can certainly randomly reshuffle the rows (and tend to do so anyway in an abundance of caution) but this suggests to me something more purposeful.

Let’s say I have a hierarchical model, with every row having some number of observations of the various combinations of:

  • Group A (200 members)
  • Group B (750 members)
  • Group C (1250 members)
  • Group D (2000 members)
  • Group E (5000 members)

Are you saying that there is a thoughtful way to reshuffle the rows, preferably with some well-known function, in an otherwise pseudo-random way mindful of this group structure, given the vastly different number of members of each group? Apologize in advance if I am just missing the point. Thanks!

letters are rows

Bad: ABBCCCDDDDEEEEE (sorted by blocks of increasing order)
Good: AEEEEECCCBBDDD (blocks are at random order)

Thanks, this is helpful.

So the block sorting is “random” in the sense that there is no particular reason for the way the blocks end up being sequenced, but they nonetheless end up in some overall sequence. For dataset construction purposes then, you can just pick an order that is a mixture of group sizes and probably end up with something reasonably optimal, if I am reading you correctly. In terms of exactly which group order ends up being truly optimal within that general approach, that’s something the user has to test and discover for themselves. All true?

Correct.

Don’t waste your time on fiddling out the “optimal” sorting. Concentrate on modeling your problem (unless you are cracking down a real-time high-performance problem that is). It really should not matter all that much as long as you are getting the basics right.