Any intuitions/results on when GPU>reduce_sum?

mike-lawrence · July 30, 2020, 7:17pm

This topic arose in advising a user seeking to accelerate their sampling using cloud services. At first I suggested they look into reduce_sum(), but then remembered that the GPU crew implemented accelerators for the likelihood computation for GLMs. Do we have any intuitions or results yet on when it would be advantageous to use one over the other (presumably in the context of GLMs)?

I see from the GPU paper that they seem to max out at 10x speedups, so would the advice be as simple as: if the GPU is more expensive to rent than 10 cores, go with reduce_sum()? Or does reduce_sum() similarly have an intercept-and-diminishing-returns curve that needs to be taken into account (I’m thinking surely this is the case, as rarely do we get parallelism for free in computing).

rok_cesnovar · July 30, 2020, 7:40pm

It’s really hard to guess which of them would be faster. I kind of doubt anyone has intuition on this at this point, both of these features are fairly new. It is an interesting question though! You definitely got me wondering :)

The GLM is very memory-bound so I am curious how reduce_sum, which is known to improve caching, affects this.

Its actually max 10 for k = 10, it maxes at 30 for k = 2000 and n = 10000 (see bottom right corner of Figure 6 - bottom one in the pasted image). Also note this is 4 chains on a CPU vs 4 chains on a GPU. So multiply the cores x4.

It definitely does.

This is really a tough one to call here. The parameters for the question here are:

input size (n and k)
num of cores on the CPU
size and speed of CPU caches
how fast is the GPU

My rule of thumb would be that if X and Y are data, a GLM is your bottleneck and you have a decent GPU, the GPU will be faster for most cases. Plus you don’t have to rewrite your model to try it out.

For other cases, I have no idea ATM, unfortunately.

p.s.: code to replicate for anyone interested: GitHub - bstatcomp/stan_gpu_install_docs: Installation instructions for the Stan parallel framework.
I do have to look at the cmdstanr scripts, there was a lot of action in cmdstanr, hopefully this didn’t break.

stevebronder · July 30, 2020, 7:45pm

The other thing to note here is that those plots are from a titan xp and AMD radeon which are consumer GPUs. If you using something like an Nvidia V100 on the cloud those speedups can be much bigger. @rok_cesnovar I remember we did that at some point in time didn’t we? I thought the speedups were like 4x over the titan xp but can’t remember if that’s correct

rok_cesnovar · July 30, 2020, 7:48pm

Indeed, somewhere along those lines. We used it for the GP example. But V100 is not for everyone’s wallet :)

saudiwin · December 18, 2020, 6:57pm

What happens if you combine reduce_sum with gpu support? Will it offload matrix calculations to the GPU within threads? Or does that just get turned off?

rok_cesnovar · December 18, 2020, 7:05pm

Yes it would create N separate calls to the GPU which could improve the throuhgput of the GPU.

The list of supported function for 2.25 is unfortunately quite limited (GLMs + cholesky, mvidide_left_tri, and multiply essentially).

The next release will have quite a few more (scroll down to Using the OpenCL backend here http://mc-stan.org/math/opencl_support.html ) plus the backend is much more efficiently used now.

saudiwin · December 18, 2020, 7:22pm

Fascinating. Though this would only work in shared-memory parallelism, right? With MPI there would have to be a GPU present on the worker.

rok_cesnovar · December 18, 2020, 7:56pm

Its very exciting indeed. Finally after 2.5+ years we should have an exciting release for those looking at GPU support.

Yes, with MPI the worker would have to have a GPU present.

What we are also seeing is that running multiple chains with a single GPU increases the speedup compared to the CPU. So time(4 chains CPU)/time(4 chains GPU) >> time(1 chain CPU)/(1 chain GPU). More on that is to come in a docs section/chapter on this together with the code for the 2.26 release in late January.

saudiwin · December 19, 2020, 6:29pm

This is pretty cool. I have an excellent use case for GPU + reduce_sum so I look forward to 2.26. In fact I’m going to start working on purchasing a GPU so I can make use of it.

rok_cesnovar · December 19, 2020, 6:35pm

If by any chance you have the model/data that you can share I would be interested in trying that out on the development version.

Topic		Replies	Views
Reduce_sum performance Modeling	5	867	May 22, 2020
Reduce_sum results in much slower run times, even for large datasets Algorithms paralellization	6	1421	March 17, 2022
Understanding reduce_sum efficiency Modeling	10	850	March 22, 2021
Help speeding up bernoulli Gaussian process model Modeling	24	1632	January 24, 2021
Reduce_sum performance Modeling performance , paralellization	15	1345	September 28, 2020

Any intuitions/results on when GPU>reduce_sum?

Related topics