GPU Speedup Experiences & Hardware

Hi all,

I’m just wondering about others’ experiences with Stan on the GPU. I run most of my models threaded via reduce_sum & find truly substantial slowdowns with more complicated models when I enable opencl on model trials with smaller data - 500-1000 observation subsets used for model development.

I can see the GPU working around 50-60%, but it runs maybe 5x slower when I enable opencl, with or without threading, & whether I leave the reduce_sum in place or flatten it. It’s mainly heirarchical normal_lpdf stuff, with the odd latent GP.

Having seen the writups of GPU performance I’m thinking it’s down to my setup - an old 32GB AMD Firepro W9100 - rated at ~5200 GFlops for single precision, though it has an uncapped ~2600 GFlops for double precision & still gets solid double precision performance in benchmarks.

Does Stan really use the double precision? Theoretically this card should handily beat all the (double-precision-capped) consumer NVIDIA cards in FP64 unless they do some software emulation using single precision that I don’t know about.

Perhaps I’ll only see speedups with 50k-100k datasets? Perhaps I’m expecting too much? Just thought I’d ask for others’ experiences with GPU speedups.



1 Like

I found that using the -o1 flag and reduce_sum was the best way to accelerate my big model. My guess is that indexing was actually the bottleneck and stan was trying to do all of it on the CPU before sending matrices to the GPU for math

1 Like

Stan exclusively uses double-precision, yes. This is why AMD GPUs are typically much better in terms of performance as opposed to NVIDIA, which targets problems that require single-precision or even half-precision these days.

Yeah, I would not expect much speedup for 500-1000, but that varies on where most of the time is spent.

At the moment (version 2.29), you should expect to see GPU speedup mainly if most of the gradient evaluation time is spent in lpdf functions. Most gains are currently gained if you use a GLM lpdf/lpmf function (which obviously is not applicable everywhere). I should also note that as of now, the Stan model will use OpenCL only for lpdf/lpmfs that are not inside user-defined functions (so transformed parameters and model block). This is a bit annoying, I understand, but that is how we were able to do it most reliably at this time.

This should change in the next releases. The work in the backend was finished a while ago, but not everything has made it upstream to the Stan level just yet, as that requires a bit more careful consideration and testing. I am hoping to get to that now.

The biggest questions when it comes to GPU use are:

  • is the model taking a lot of time because the gradient evaluation is slow or because we are doing a ton of gradient evaluations (the tree depth numbers are big). If you are doing a ton of iterations with gradient evaluations in the range of a few milliseconds, the GPU will not be of much use. reduce_sum might help there, but also not a sure thing.
  • where are we spending most of the time in gradient evaluation (use profiling to find that out)? If its a lpdf/lpmf function, try to call it in the model block and the GPU should help there.

If the vast majority of time is not spent in the lpdf/lpmf functions, I would second what @darby suggests, using reduce_sum with the --O1 flag.


I’m aware I didn’t post my code, so I really appreciate the open-minded & sensible answers here!

Thanks to @darby - I’d forgotten to switch on optimisation for my threaded example, so this may very well make my OpenCL performance look even worse!

It’s really helpful to see detailed information like this, thanks @rok_cesnovar. I’m running v2.29.1. I’d understood it needed double-precision, but wondered if I’d fallen behind on the news. I’d also been wondering whether I’m just getting unlucky with my models, or whether I’m writing code that performs well when threaded, but less well on the GPU, or whether it’s my GPU. At least I can tick off the hardware for now.

Treedepth numbers are OKish, but stable - no chains hitting 10, but 99.4% of iterations took exactly 8 steps, so this parameterisation will always be slow. It’s definitely a complex & flexible parameter space. No divergences or B-EFMI, but I’d like to work out how to do mixed centering on correlated ‘random’ effect levels - this may really help speed. I’ll open a separate topic on that. Everything is vectorised, but there is still some iteration across a matrix in one step. I could probably change that to improve speed/make it more GPU friendly too, and make sure I’m using the GLMs where I can.

I’ll continue to see how this model peforms & profiles when I start playing with something closer to the full dataset (easily 80k, & potentially up to many millions of rows if I was brave/rich enough), as I expect even threading to really hurt there. I’ll also check for the OpenCL gotcha of lpdf functions within UDFs - that could be a potential issue I could address.

Thanks again for the input, and for the update on the current state of OpenCL support. I just need to work out how to write models to get the most out of it!



For posterity, I am now strongly suspicious the slowdowns I observed when enabling gpu support were caused by the consistently 2^7-8 leapfrog steps per iteration as @rok_cesnovar mentioned. Dumping that on the GPU wasn’t very fair.

Perhaps we should view the ‘failure’ to use OpenCL for UDFs as a feature = perhaps it means we can control which parts of the model run on a GPU vs CPU - e.g. push the gaussian process code to the GPU while running the faster stuff on the CPU by wrapping it in a UDF? I’ll have to play with that a lot as I scale up my test data.