Hi everyone! it’s been a minute so I though some people would want a status update on the GPU stuff and what’s going on.
We are finishing up the PR for the inverse on the GPU (to the point of naming things and adding some docs). Once that’s done we have one more PR to bring in the kernel for the Cholesky GPU, it’s primitive, and the derivative. We may break the Cholesky into two separate PRs, one for the kernel code / primitive and another for the derivative. Two PRs could be nice in terms of making an easier code review.
After that it’s making sure cmdstan and the downstream projects are able to access the current GPU implementation.
As for the future, below is a list of projects I’ve compiled and shared with Rok. I’ve tried sorting these by ‘value’ where value
is brutishly hand waved as usefulness / effort
. I’m also probably more biased towards doing low effort things first.
- Making
matrix_cl
structs for arguments in the kernel signatures
(medium-high value / low effort)
See here:
This means kernel signatures will go from something like
void transpose(__global double *B, __global double *A,
unsigned int rows, unsigned int cols)
to something like
void transpose(__global matrix_cl *B, __global matrix_cl *A)
The struct would hold the normal stuff we already pass, clean up the signatures, and will remove the gross macros we use now.
The struct can be pretty light and hold
- indexing by overloading the () operator.
- Rows and col dims
- ??? !stuff! ???
The only downside to this is we would need the structs definition to exist on both the host and device so we need to duplicate the code or do something else clever.
- Using CL_MEM_ALLOC_HOST_PTR and other currently available features to speed up the host to device memory transfer
(medium-low value / low effort)
There’s a few OpenCL 1.2 features we are not using that could help us out with memory transfer speeds. For instance, CL_MEM_ALLOC_HOST_PTR
pins memory to the host RAM and stops the memory from being transferred to virtual memory which can give us some speed.
- Adding multiplication / addition / subtraction / covariance functions derivatives and some other GPU functions
(medium value / lowish effort)
Looking at the derivative for multiplication idt we would even need any new kernels? Some of these like the GP kernels may require a bit of effort.
The less plug and play ones may be good for master’s students to work on for class projects. I can reach out to the professor who taught my GPU course and throw this out there for people.
The glm kernel methods Erik mentioned also go here (the glm should probably be higher up on this list)
- Playing nicely with MPI through out of order command queues and multiple GPUs
(medium value / low-medium effort)
Adding the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
to a device’s command queue means that we now [have to / get to] use event management when we execute kernels.
Right now we are waiting on all the commands in a command queue to finish one after the other, we can instead have
a. events per matrix_gpu
b. events per kernel
Then feed those events into the other kernels. idt this would matter much for a single thread, but it would be good if we had multiple threads since the GPUs are able to do a read/write while computing stuff
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html
- Making a Cache for data on the GPU
(high value / medium-high effort)
If we know the memory addresses of data during the life of the program, when we transfer data over to the gpu we can have a map with the address being the key and a pointer to the device buffer as the value. The next time this data is transferred we just reference the already allocated buffer’s pointer and return if from cache instead of making the actual copy. This would be fast, but we would probably need to let users fine tune the cache size for their device and problem. We also have have to be careful about in place operations. Maybe we could hold a copy of the original buffer in the dictionary.
- Using OpenCL 2.0 features
(medium value / medium effort)
Reading below article (ctrl+f for ‘Experimental OpenCL 2.0’) it says’ the newest Nvidia drivers have a good bit of OpenCL 2.0 available. Some older devices may not have these drivers available and the features are still in beta, but OpenCL 2.0 has Shared Virtual Memory, which in short means that
- The host and device can share complex pointer types.
- I’m pretty sure this means we could pass whole chunks of the expression tree over to the GPU
- SVM can utilize intels accelerated memory hardware for very fast transfers of data between host and device.
- Expression Templates for the GPU functions
(V high value / V high effort)
I read the proto docs and some other expression template stuff online, but I’m not totally sure I could implement this in a reasonable amount of time. It feels like doing this in a not gross way is going to be a substantial effort.
I’m not sure if my ranking is correct or if I missed anything important so if you can think of how to tackle some of these or whether one is more important than the other lmk! My personal favorites are the struct, cache, and multiple GPUs w/ mpi.