Any intuitions/results on when GPU>reduce_sum?

This topic arose in advising a user seeking to accelerate their sampling using cloud services. At first I suggested they look into reduce_sum(), but then remembered that the GPU crew implemented accelerators for the likelihood computation for GLMs. Do we have any intuitions or results yet on when it would be advantageous to use one over the other (presumably in the context of GLMs)?

I see from the GPU paper that they seem to max out at 10x speedups, so would the advice be as simple as: if the GPU is more expensive to rent than 10 cores, go with reduce_sum()? Or does reduce_sum() similarly have an intercept-and-diminishing-returns curve that needs to be taken into account (I’m thinking surely this is the case, as rarely do we get parallelism for free in computing).


It’s really hard to guess which of them would be faster. I kind of doubt anyone has intuition on this at this point, both of these features are fairly new. It is an interesting question though! You definitely got me wondering :)

The GLM is very memory-bound so I am curious how reduce_sum, which is known to improve caching, affects this.

Its actually max 10 for k = 10, it maxes at 30 for k = 2000 and n = 10000 (see bottom right corner of Figure 6 - bottom one in the pasted image). Also note this is 4 chains on a CPU vs 4 chains on a GPU. So multiply the cores x4.

It definitely does.

This is really a tough one to call here. The parameters for the question here are:

  • input size (n and k)
  • num of cores on the CPU
  • size and speed of CPU caches
  • how fast is the GPU

My rule of thumb would be that if X and Y are data, a GLM is your bottleneck and you have a decent GPU, the GPU will be faster for most cases. Plus you don’t have to rewrite your model to try it out.

For other cases, I have no idea ATM, unfortunately.

p.s.: code to replicate for anyone interested:
I do have to look at the cmdstanr scripts, there was a lot of action in cmdstanr, hopefully this didn’t break.


The other thing to note here is that those plots are from a titan xp and AMD radeon which are consumer GPUs. If you using something like an Nvidia V100 on the cloud those speedups can be much bigger. @rok_cesnovar I remember we did that at some point in time didn’t we? I thought the speedups were like 4x over the titan xp but can’t remember if that’s correct


Indeed, somewhere along those lines. We used it for the GP example. But V100 is not for everyone’s wallet :)

What happens if you combine reduce_sum with gpu support? Will it offload matrix calculations to the GPU within threads? Or does that just get turned off?

Yes it would create N separate calls to the GPU which could improve the throuhgput of the GPU.

The list of supported function for 2.25 is unfortunately quite limited (GLMs + cholesky, mvidide_left_tri, and multiply essentially).

The next release will have quite a few more (scroll down to Using the OpenCL backend here ) plus the backend is much more efficiently used now.

1 Like

Fascinating. Though this would only work in shared-memory parallelism, right? With MPI there would have to be a GPU present on the worker.

Its very exciting indeed. Finally after 2.5+ years we should have an exciting release for those looking at GPU support.

Yes, with MPI the worker would have to have a GPU present.

What we are also seeing is that running multiple chains with a single GPU increases the speedup compared to the CPU. So time(4 chains CPU)/time(4 chains GPU) >> time(1 chain CPU)/(1 chain GPU). More on that is to come in a docs section/chapter on this together with the code for the 2.26 release in late January.


This is pretty cool. I have an excellent use case for GPU + reduce_sum so I look forward to 2.26. In fact I’m going to start working on purchasing a GPU so I can make use of it.


If by any chance you have the model/data that you can share I would be interested in trying that out on the development version.