and for each 10,000 elements of the array, you want to do a log-sum-exp? That’s certainly enough to be worth parallelizing, though as I said, I have no idea how easy it would be to do on the GPU (@seantalts, @rok_cesnovar, or @stevebronder would know).
This isn’t just for overflow. It also gives you better leading digit accuracy because the most significant term is pulled out unmodified, though that’s relatively minor.
That has different memory locality, because matrices are column major. So you’d probably want to do the transpose here to keep each of the vectors together in memory.
User interface - we think having a single control point for GPU (adding CXXFLAGS+=-DSTAN_CL` to make/local) is probably the best place to start, though eventually we’d like to add an OpenCL annotation or type to the Stan language itself so that the user has fine-grained oversight what exactly is executed on the GPU.
Performance - Another PR adds a runtime pointer to Eigen matrices to the cl::Buffer on the GPU when the OpenCL compile flag is on. This means we can avoid extra copies for data matrices (e.g. currently once per leapfrog step). The opencl version of choleksy_decompose has a rough heuristic for matrix size below which it will just execute the CPU version.
The other favorite was to add a cholesky_decompose_gpu, but we think this has one significant disadvantage. The majority of time in GPU computation is often spent sending data to and from the GPU, so if you only allow explicit GPU functions, we won’t have the opportunity to use GPU versions when available. With the style of overloading above, we can stay in GPU computation land as long as possible before copying back to the CPU. With both versions, the result is copied back to CPU memory, but with this overloaded version we also keep a pointer to the buffer on the GPU.
We’ve discussed this a few times but I’m not sure if we ever explicitly aired it out on the discourse and had a conclusion, so I wanted to bring it up here. Any additional last-minute considerations welcome.
Okay nice! I think we could write a gpu kernel to do row-wise log_sum_exp in this case. We have a couple other things in the pipeline but can add this to our to-do list
@rok_cesnovar also posted a performance update for the GPU code on the Cholesky primitive PR that I’ll copy paste below
I finally managed to finish the performance tests. I played a bit with optimizing the block sizes.
N is the size of one dimension of the square input matrix.
The bottom line is that the Titan XP is up to 20 times faster than an i7-4790 CPU @ 3.60GHz.
The V100 is 70 times faster for N=20000 and from the figure it looks like this is not the upper limit yet.
The GPU outperforms the CPU at around N~800-900. For N=2000 the speedup is 3+. If we will be doing any smart logic to decide when to use the GPU or not, I would suggest 1500 or 2000 as the “border”.
One other thing we need to change. Currently users select the device and OpenCL platform at compile time, but we should allow these to be passed at runtime