GPU Update: what's up and where we are going



You mean you have something like:

vector[30] lp[10000];

and for each 10,000 elements of the array, you want to do a log-sum-exp? That’s certainly enough to be worth parallelizing, though as I said, I have no idea how easy it would be to do on the GPU (@seantalts, @rok_cesnovar, or @Stevo15025 would know).

This isn’t just for overflow. It also gives you better leading digit accuracy because the most significant term is pulled out unmodified, though that’s relatively minor.


Looking at that, yes we can do strides for the max and sum, the exp and log are embarrassingly parallel at that point.

How is vector[30] lp[10000] stored in the math library?


std::vector<Eigen::Matrix<T, -1, 1>>


vector[30] lp[10000] is due to the current implementation in Stan.
From the data-generation process a Matrix [10000,30] would be fine.


That has different memory locality, because matrices are column major. So you’d probably want to do the transpose here to keep each of the vectors together in memory.


We have a PR in review for Cholesky decomposition(Thanks @rok_cesnovar and @Stevo15025!). There’s one thing that probably needs to be broadcast here - we’re thinking of overloading the existing stan::math::cholesky_decompose function such that it will run on the GPU if you have the CXXFLAG for GPU set. This happens here:

Here are some things to consider for this API:

  • User interface - we think having a single control point for GPU (adding CXXFLAGS+=-DSTAN_CL` to make/local) is probably the best place to start, though eventually we’d like to add an OpenCL annotation or type to the Stan language itself so that the user has fine-grained oversight what exactly is executed on the GPU.
  • Performance - Another PR adds a runtime pointer to Eigen matrices to the cl::Buffer on the GPU when the OpenCL compile flag is on. This means we can avoid extra copies for data matrices (e.g. currently once per leapfrog step). The opencl version of choleksy_decompose has a rough heuristic for matrix size below which it will just execute the CPU version.

The other favorite was to add a cholesky_decompose_gpu, but we think this has one significant disadvantage. The majority of time in GPU computation is often spent sending data to and from the GPU, so if you only allow explicit GPU functions, we won’t have the opportunity to use GPU versions when available. With the style of overloading above, we can stay in GPU computation land as long as possible before copying back to the CPU. With both versions, the result is copied back to CPU memory, but with this overloaded version we also keep a pointer to the buffer on the GPU.

We’ve discussed this a few times but I’m not sure if we ever explicitly aired it out on the discourse and had a conclusion, so I wanted to bring it up here. Any additional last-minute considerations welcome.


Okay nice! I think we could write a gpu kernel to do row-wise log_sum_exp in this case. We have a couple other things in the pipeline but can add this to our to-do list


Thanks for adding an update Sean!

@rok_cesnovar also posted a performance update for the GPU code on the Cholesky primitive PR that I’ll copy paste below

I finally managed to finish the performance tests. I played a bit with optimizing the block sizes.
N is the size of one dimension of the square input matrix.

The bottom line is that the Titan XP is up to 20 times faster than an i7-4790 CPU @ 3.60GHz.
The V100 is 70 times faster for N=20000 and from the figure it looks like this is not the upper limit yet.

The GPU outperforms the CPU at around N~800-900. For N=2000 the speedup is 3+. If we will be doing any smart logic to decide when to use the GPU or not, I would suggest 1500 or 2000 as the “border”.


@tadej has already volunteered to do this,I forgot to write it here.


One other thing we need to change. Currently users select the device and OpenCL platform at compile time, but we should allow these to be passed at runtime