Any intuitions/results on when GPU>reduce_sum?

This topic arose in advising a user seeking to accelerate their sampling using cloud services. At first I suggested they look into reduce_sum(), but then remembered that the GPU crew implemented accelerators for the likelihood computation for GLMs. Do we have any intuitions or results yet on when it would be advantageous to use one over the other (presumably in the context of GLMs)?

I see from the GPU paper that they seem to max out at 10x speedups, so would the advice be as simple as: if the GPU is more expensive to rent than 10 cores, go with reduce_sum()? Or does reduce_sum() similarly have an intercept-and-diminishing-returns curve that needs to be taken into account (I’m thinking surely this is the case, as rarely do we get parallelism for free in computing).


It’s really hard to guess which of them would be faster. I kind of doubt anyone has intuition on this at this point, both of these features are fairly new. It is an interesting question though! You definitely got me wondering :)

The GLM is very memory-bound so I am curious how reduce_sum, which is known to improve caching, affects this.

Its actually max 10 for k = 10, it maxes at 30 for k = 2000 and n = 10000 (see bottom right corner of Figure 6 - bottom one in the pasted image). Also note this is 4 chains on a CPU vs 4 chains on a GPU. So multiply the cores x4.

It definitely does.

This is really a tough one to call here. The parameters for the question here are:

  • input size (n and k)
  • num of cores on the CPU
  • size and speed of CPU caches
  • how fast is the GPU

My rule of thumb would be that if X and Y are data, a GLM is your bottleneck and you have a decent GPU, the GPU will be faster for most cases. Plus you don’t have to rewrite your model to try it out.

For other cases, I have no idea ATM, unfortunately.

p.s.: code to replicate for anyone interested:
I do have to look at the cmdstanr scripts, there was a lot of action in cmdstanr, hopefully this didn’t break.


The other thing to note here is that those plots are from a titan xp and AMD radeon which are consumer GPUs. If you using something like an Nvidia V100 on the cloud those speedups can be much bigger. @rok_cesnovar I remember we did that at some point in time didn’t we? I thought the speedups were like 4x over the titan xp but can’t remember if that’s correct


Indeed, somewhere along those lines. We used it for the GP example. But V100 is not for everyone’s wallet :)