GPU Update: what's up and where we are going

Hi everyone! it’s been a minute so I though some people would want a status update on the GPU stuff and what’s going on.

We are finishing up the PR for the inverse on the GPU (to the point of naming things and adding some docs). Once that’s done we have one more PR to bring in the kernel for the Cholesky GPU, it’s primitive, and the derivative. We may break the Cholesky into two separate PRs, one for the kernel code / primitive and another for the derivative. Two PRs could be nice in terms of making an easier code review.

After that it’s making sure cmdstan and the downstream projects are able to access the current GPU implementation.

As for the future, below is a list of projects I’ve compiled and shared with Rok. I’ve tried sorting these by ‘value’ where value is brutishly hand waved as usefulness / effort. I’m also probably more biased towards doing low effort things first.


  1. Making matrix_cl structs for arguments in the kernel signatures
    (medium-high value / low effort)

See here:

This means kernel signatures will go from something like

void transpose(__global double *B, __global double *A,
               unsigned int rows, unsigned int cols)

to something like

void transpose(__global matrix_cl *B, __global matrix_cl *A)

The struct would hold the normal stuff we already pass, clean up the signatures, and will remove the gross macros we use now.

The struct can be pretty light and hold

  1. indexing by overloading the () operator.
  2. Rows and col dims
  3. ??? !stuff! ???

The only downside to this is we would need the structs definition to exist on both the host and device so we need to duplicate the code or do something else clever.


  1. Using CL_MEM_ALLOC_HOST_PTR and other currently available features to speed up the host to device memory transfer
    (medium-low value / low effort)

There’s a few OpenCL 1.2 features we are not using that could help us out with memory transfer speeds. For instance, CL_MEM_ALLOC_HOST_PTR pins memory to the host RAM and stops the memory from being transferred to virtual memory which can give us some speed.


  1. Adding multiplication / addition / subtraction / covariance functions derivatives and some other GPU functions
    (medium value / lowish effort)

Looking at the derivative for multiplication idt we would even need any new kernels? Some of these like the GP kernels may require a bit of effort.

The less plug and play ones may be good for master’s students to work on for class projects. I can reach out to the professor who taught my GPU course and throw this out there for people.

The glm kernel methods Erik mentioned also go here (the glm should probably be higher up on this list)


  1. Playing nicely with MPI through out of order command queues and multiple GPUs
    (medium value / low-medium effort)

Adding the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE to a device’s command queue means that we now [have to / get to] use event management when we execute kernels.

Right now we are waiting on all the commands in a command queue to finish one after the other, we can instead have
a. events per matrix_gpu
b. events per kernel

Then feed those events into the other kernels. idt this would matter much for a single thread, but it would be good if we had multiple threads since the GPUs are able to do a read/write while computing stuff

https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html


  1. Making a Cache for data on the GPU
    (high value / medium-high effort)

If we know the memory addresses of data during the life of the program, when we transfer data over to the gpu we can have a map with the address being the key and a pointer to the device buffer as the value. The next time this data is transferred we just reference the already allocated buffer’s pointer and return if from cache instead of making the actual copy. This would be fast, but we would probably need to let users fine tune the cache size for their device and problem. We also have have to be careful about in place operations. Maybe we could hold a copy of the original buffer in the dictionary.


  1. Using OpenCL 2.0 features
    (medium value / medium effort)

Reading below article (ctrl+f for ‘Experimental OpenCL 2.0’) it says’ the newest Nvidia drivers have a good bit of OpenCL 2.0 available. Some older devices may not have these drivers available and the features are still in beta, but OpenCL 2.0 has Shared Virtual Memory, which in short means that

  1. The host and device can share complex pointer types.
    • I’m pretty sure this means we could pass whole chunks of the expression tree over to the GPU
  2. SVM can utilize intels accelerated memory hardware for very fast transfers of data between host and device.

  1. Expression Templates for the GPU functions
    (V high value / V high effort)

I read the proto docs and some other expression template stuff online, but I’m not totally sure I could implement this in a reasonable amount of time. It feels like doing this in a not gross way is going to be a substantial effort.


I’m not sure if my ranking is correct or if I missed anything important so if you can think of how to tackle some of these or whether one is more important than the other lmk! My personal favorites are the struct, cache, and multiple GPUs w/ mpi.

10 Likes

@Erik_Strumbelj is there a writeup for the glm things we can use as a starting proposal

@avehtari I feel like you are the main gp guy so any special requests I am happy to hear!

Great breakdown Steve.

Me and Davor will probably focus mostly on 3. (getting new function derivatives) as most of these will not require new kernels (apart from the gp_cov_exp_quad and the like).

As these PRs usually take some time, the next goal I started working on is the data transfers, but not the ones mentioned in 5. I am interested in solving the data transfers between functions, for example

AD+BC would now require:

  • transferring input data for A*D and transferring the result back
  • transferring input data for B*C and transferring the result back
  • transferring input data for + and transferring the final result back

I would like to get to this:

  • transferring input data for A*D
  • transferring input data for B*C
  • transferring the final result back

Of the once you mentioned 5. is the most crucial I think. I would also love to upgrade to OpenCL 2.0.

1 Like

For many GPs, making Cholesky, matrix products, and matrix sums faster is already helpful. For some bigger GPs the bottle neck is the autodiff, and it will take some time before we’ll have something which would be fast and flexible enough (details elsewhere) that there would be sense to try to add more GPU things before everything else you have on your list.

2 Likes

Thanks @stevebronder!

There really isn’t that much to say about the GLM. We would have them completed them already, but we stopped working on it until the unnecessary data transfers are sorted out.

I have a list (somewhere) with the GLM and other models that we want to parallelize and also some matrix algebra primitives. We can do all this work within our group at University of Ljubljana. Just waiting for the data transfers. From my perspective, that is what we should focus our effort on. At least not transferring the things we know are constant (data, transformed data).

@tadej has also been working on a GPU implementation of the QR decomposition.

This is all great news. Coupling GPU w MPI sounds like an amazing thing - if you got a problem which amends to this then this approach will be a massive jump in performance. For this to work well we need to think about data caching. Are u planning to use the MPI via map_rect or even more tightly integrated?

… most importantly: At which milestone does it make sense to make all this nice work appear in the wild to be used by stan users. Is the cholesky the operation after which you would suggest a release? … and most importantly… any idea when this happens? I know things take longer than anticipated…

The current plan is:

  1. get the lower_triangular_inverse PR through ( close, as Steve said we are finishing up docs and such)
  2. /prim cholesky_decompose
  3. /rev cholesky_decompose
  4. PRs to Stan/Cmdstan to expose to the Stan user

A large majority of the required GPU code (kernels) is already in Stan Math, the inverse had by far the most new GPU code of all the GPU PRs so it took some additional time to review and polish. (2.) has a small GPU kernel while the rest are just calls to the code already in Stan Math. PR for (2.) will be small, (3.) will be a bit larger, but there will be no additional GPU kernel code, so it will be easier to review.

We still havent discussed how we are going to expose the GPU versions of the functions, if they are going to have a _gpu suffix or are we going to call the GPU code inside regular Stan Math functions. If we are going to the former, we will need to add new functions in (4.).

In terms of a timeframe, the code is finished (apart from merging the cleanup we did in (1.)), so it all depends on the length of the review process. My guess would be about a month, but I hope something more like 2-3 weeks.

This should probably be discussed at a stan meeting. My favourite would be to make it work similar to map_rect. That would mean that compiler flags control if it is enabled or not. In this case it is a bit more tricky as small matrices should probably still go to the CPU instead of the GPU, so there needs to be more “wizdom” build into the functions.

I am really looking forward to this… we are about to get some of the Nvidia V100 cards in our cluster; so great timing.

Agree!

This would be my favourite also.

Nice! The V100s are real beasts! I will present the benchmarks for the V100 in the PRs for (2.) and (3. like we did for the lower tri inverse here.

Somewhat unrelated but since this thread is discussing updates on the GPU:

Re. writing own framework to implement/call OpenCL kernels, was Boost’s compute evaluated as a possible option? It already provides a very clean interface to using existing OpenCL kernels and write/call our own. At this stage though adding this would mean making significant changes to the framework (?)

Yes we had a discussion about this in a PR. I found it a while ago and was excited at first, but the master branch on github is failing and Dan and I were not a big fan of the syntax.

I’m pretty happy with the current implementation, but if there’s something we missed that boost compute would give us relative to our current model then I’d be happy to discuss it.

Yes I agree, I’m planning to come to either the meeting this week or next week. At first I was a fan of having the _gpu suffix, but if we can be clever with when we turn the gpu stuff on and off then idt it’s necessary.

I think we can get away with users using map_rect for now. Each thread will be sending a bunch of reads/writes and routines to execute and they will keep the gpu v v busy since it can read/write and run routines at the same time.

I think soonish i.e. in the next two weeks I need to come to a stan meeting and chat about how users should touch this stuff.

I agree!

Thanks for the updates everyone!

Is it possible to use any of these new methods in Stan now? I am particularly keen to try the GPU matrix multiply. Happy to use cmdstan, and also an experimental branch. I followed the steps here but I can’t seem to use multiply_matrix_gpu in a model. It fails saying “No matches for:” multiply_matrix_gpu(matrix, matrix). Is there something I need to do?

The functions have not been exposed to the Stan language yet, so far you can only use them in Stan Math. The first one in line to be exposed to the Stan language is the cholesky_decompose one. Others should then follow soon as the GPU code is in Stan Math already.

Isn’t the plan just to have an environment variable that turns on or off GPU calculations? Or is the plan now to go to specific GPU functions?

If the plan is the second, will the specific GPU functions revert back to using CPUs if GPUs aren’t available? I’m worried about people having to write multiple versions of their model depending on where they run it.

Yeah that was my thinking/idea also, but as we have not open up the issue/PR/discussion yet I didnt want to presume anything without the input from the guys in charge.

EDIT: The cholesky prim PR that is coming up today/tomorrow and will have things prepared in a way that the user will turn on/off GPU calculaction with environment variables.

We were going to have our own cholesky_decompose_gpu but for the reasons you mention it’s probably best to have users just turn on the compile flag.

I think it comes down to how fine-grained you envision the control being. For instance, could some decompositions be done on the GPU and others stay on the CPU? Then we need a way to specify for each call.

It’s always easier on the users to add new functions later than take them away.

If I had one wish it would be an array based vectorized log_sum_exp function on the GPU.
This would help to realize many multinomial models.

1 Like

How big are the vectors you’re passing in?

The arithmetically stable form is:

log_sum_exp(alpha) = max(alpha) - log(sum(exp(alpha - max(alpha)))                                 

But I don’t see great opportunities for parallelization. We could maybe do the exp() and differencing in parallel and maybe (not sure how restrictive GPUs are) the max(). But the max has to be done first before the log-sum-exp part for stability. Is there an efficient sum?

The vector-size is 30, the array has 10K elements.
I think the shifts between max calculation and the floating point arithmetic is
costly for the processing unit.
It would be better to calculate all the max. and then do the math.
Or be optimistic do first the math and in case of overflow do to max tricks
and recalculate.