Grainsize estimation for multi-threaded chains (reduce_sum): discussion with heterogeneous computing

By heterogeneous computing, the most common examples are the Apple Silicon chips and (more recently) other chips that have both performance and efficiency cores. Let’s come back to that.

For choosing grainsize generally, we have some guidance in the section in the Stan Users’ Guide: Parallelization which suggests two approaches: 1) set a grainsize = 1 and let the scheduler figure it out or 2) start with observations / threads per chain, time running the model a bit, divide by two and repeat until satisfied.

I’ve never actually been sure what the first option is precisely saying (and I think implementation has evolved) … unless this just means that each observation is literally sliced and the OS scheduler pushes each observation separately to whatever core is available. If so, then in many cases that seems close to optimal only for a subset of observation/parameter setups. If something else is happening (like optimal slice sizes are being estimated) it would be great to understand the algorithm or method.

I normally just choose the second option when I can assume the compute time within each slice is roughly the same and the time to increment the log likelihood dominates the memory copy time, I typically leave it at that, which should optimize copying the least amount of time while maximizing use of each core.

It may be useful in documentation to offer a few examples about when we can make these assumptions and, for that matter, a few examples of when it may really make sense to set the grain size to 1.

But once we have made the assumption to start with observations / threads per chain, it seems we can further optimize to make use of both performance and efficiency cores.

The Heterogeneous Computing Challenge

Apple Silicon processors like the M3 Ultra have heterogeneous cores: 24 high-performance cores and 8 efficiency cores. Based on architectural analysis by Howard Oakley of The Eclectic Light Company[1], the efficiency cores are estimated to deliver roughly 70% of the computational performance of the performance cores for mathematical workloads, though this awaits empirical verification with computational benchmarks.

The standard approach of setting

\text{grainsize} = \frac{\text{total observations}}{\text{threads per chain}}

creates equal-sized chunks that lead to suboptimal performance. The issue is straightforward: if we have 32 threads processing equal chunks, the 24 fast performance cores will finish their work and then wait idle while the 8 slower efficiency cores are still processing. (We could, of course, ignore the efficiency cores).

A Mathematical Framework for Optimization

Rather than simply accepting the naive chunking approach, we can formulate this as an optimization problem:

\text{grainsize}^* = \underset{\text{grainsize}}{\arg\min} \left[ T_{\text{total}}(\text{grainsize}) \right]

where:

T_{\text{total}} = T_{\text{computation}} + T_{\text{memory copies}} + T_{\text{idle time}}

As grainsize decreases (creating more, smaller chunks):

  • T_{\text{memory copies}} increases (linearly?) with number of chunks
  • T_{\text{idle time}} decreases (better load balancing)
  • T_{\text{computation}} remains roughly constant

Finding the Zero-Idle-Time Boundary

It seems we can calculate the theoretical boundary where T_{\text{idle time}} = 0 by finding integer chunk ratios that match the performance differences.

Ideally, we would want the slower efficiency cores to handle as few chunks as possible (optimally just 1 chunk each) since they create the bottleneck that determines total execution time. However, the constraint that all cores must finish simultaneously limits our options to integer ratios that maintain load balance.

For M3 Ultra’s performance ratio of approximately 1.43×, the optimal integer relationship is:

  • Performance cores complete 10 chunks each
  • Efficiency cores complete 7 chunks each
  • In roughly the same time (since \frac{10}{7} \approx 1.43)

This gives us:

\text{Total chunks} = (24 \times 10) + (8 \times 7) = 296 \text{ chunks}
\text{grainsize}_{\text{zero-idle}} = \frac{\text{total observations}}{296}

The Memory Copy Trade-off

This optimization comes with significant memory overhead:

Standard approach:

  • grainsize = N / 32, creating 32 total chunks
  • Number of memory copy operations = 32

Zero-idle-time (once cores have the information) approach:

  • grainsize = N / 296, creating 296 total chunks
  • Number of memory copy operations = 296
  • Memory copy increase: \frac{296}{32} = 9.25× more memory operations

This 9× increase in memory copies emphasizes why this optimization only provides net benefits when the computational work per chunk significantly exceeds the memory management overhead.

Theoretical Performance Gain and Discrete Constraints

At the zero-idle-time boundary, the theoretical speedup is:

\text{Speedup} = \frac{P \times k_p + E \times k_e}{k_e \times (P + E)} = \frac{24 \times 10 + 8 \times 7}{7 \times 32} = \frac{296}{224} = 1.32

However, the optimization space is constrained to discrete, feasible solutions. For efficient work distribution, the total number of chunks should ideally be a multiple of 32 (total cores) for the example chip. Our theoretical solution of 296 chunks equals 9.25 chunks per core, suggesting practical candidates of:

  • 288 chunks (9 \times 32): grainsize = N / 288
  • 320 chunks (10 \times 32): grainsize = N / 320

When This Optimization Applies

This approach provides benefits when computation time dominates memory copy overhead. For Stan models, this typically applies to likelihood evaluations involving substantial mathematical operations per observation. The optimization may not be beneficial for models where memory bandwidth is the primary bottleneck, or where per-observation computation is extremely lightweight.

Questions for Empirical Testing

My logic or math may very well be flawed or incomplete in some significant way and please chime in.

Has anyone experimented or setup some kind of empirical studies for the above ideas? Anecdotally, I’ve recently applied the above logic on both an M2 Ultra and M3 Ultra for very large models to good effect.

Or have additional math that may help with practical guidance, especially without having to just divide by X each time and re-testing (which would be feasible for certain types of problems)?

[1]: Howard Oakley, “Apple silicon: 1 Cores, clusters and performance,” The Eclectic Light Company , February 19, 2024, Apple silicon: 1 Cores, clusters and performance – The Eclectic Light Company


  1. 1 ↩︎

1 Like