OpenCL 2.0

Hey all,

@stevebronder was talking about switching from OpenCL 1.2 to 2.0 (cc @rok_cesnovar) and I wanted to capture this in an open discussion. He wrote this:

We can switch to 2.0, but that means some of our users (those whose GPUs are not OpenCL 2.0 compatable) will not be able to use the 2.0 features. But, most GPUs that came out in the last two years or so should be fine. AMD and Intel support it and Nvidia has released beta support of OpenCL 2.0 on it’s latest drivers.

OpenCL 2.0 is a pretty big leap, it was a massive change and has a lot of new features. That’s good and bad. It’s taking forever for people to adopt because tbf the spec is both difficult to implement and unclear in some important places. (for example a lot of people think CL_MEM_ALLOC_HOST_PTR uses pinned memory, but which part of memory CL_MEM_ALLOC_HOST_PTR goes to is actually implementation defined.)

I think the optimal thing to do with respect to 1.2 and 2.0 is have the default be 2.0 and allow the user to pass a USE_DEPRECATED_OPENCL_1_2 flag to use the 1.2 stuff if need be.

Before we made that leap anyway we would probably want to check a bunch of the cloud gpu instances and make sure they support 2.0.

My response:
When you say people aren’t adopting it, do you mean device manufacturers? Is there some chart showing device support for 1.2 vs 2.0? And which features are we most excited about from 2.0?

I don’t think adoption is the problem but moreso that OpenCL 2.0 has a lot of new features that are taking time to do correctly.

Not really, but I can summarize some of the notes I’ve found

So, yeah I mean it’s actually just Nvidia that’s behind.

For me its Shared Virtual Memory (SVM) which let’s the host and device operate on the same virtual memory. You can read in some detail about it here. The main benefits are:

  1. If you have an integrated CPU+GPU combo on your computer you don’t pay memory transfer prices.
  2. You can pass abstract types over to the device. Right now we have to pass over linear contiguous memory, but with SVM we can actually pass over something like, a whole chunk of the expression tree. So if we had kernels for functions and their derivatives for whole a whole section of the expression tree then we can move that whole piece over and do everything over on the GPU**
  3. The host and device can use atomic operations on the SVM without transfer if fine-grain SVM is supported on the device. The extra good thing about this is that if we can figure out a way to coerce SVM and Eigen to play nicely then we can make the SVM the data in the Eigen matrices. So we can do GPU stuff, ‘pass’ that data back to Eigen, then Eigen can go about doing it’s normal operations. When we want to go do stuff on the GPU it will already know about the changes so we pay way way less transfer costs.

** This is actually better than cuda, where the host needs to keep queuing new functions. With OpenCL we can actually have the device call new kernels itself.

Overall 2.0 is v v good, though tbh Rok and I are a bit skeptical because this all sounds too good to be true. I believe Rok is planning on taking the SVM stuff for a test drive soon.

I am going to give the shared virtual memory a try Over the weekend to see if this is as good as promised.

Thanks for adding this. I’d be reluctant to adopt a GPU standard that Nvidia isn’t supporting yet. Now I don’t want to encourage companies like Nvidia to tank their opposition by not implementing, this seems like a big deal.

The free memory sharing does sound too good to be true. I can believe it’s coded that way, but I don’t see how it could perform that way.

That sounds like a big maintenance burden. We’ve tried not to have to support multiple versions of the same interface. Given how long it’s taken to get the basic GPU stuff ready to go, I’d be reluctant to try to rewrite it all at this point.

But, if there are huge performance gains, it’s probably worth doing at some point, even if we don’t roll it out immediately.

We won’t be able to make much (any?) use of this without rewriting our Math library to use MatrixBase instead of Matrices everywhere. [edit] this is because we’d need to use an Eigen::Map around a buffer we get from clSVMAlloc.

Are there many of those? My impression was most powerful GPUs are not integrated.

I think that’s the plan going forward in any case.

This weekend I’d like to reach out to the email at the end of that presentation and ask what their status is for 2.0 to move out of beta.

I’m slowly reading through the OpenCL spec and code to see how they do this or specify how to do this. Hand-waving a guess, if you have one piece of virtual memory pointing to two different pieces of physical memory I think what they are saying is not impossible.

We are 100% on the same page, moving to 2.0 is not on our current plate. We are talking about investigating this for the future.

Here is a link to Dan in the thread about input checks talking about how stan had issues with MatrixBase in the past. In the linked google discussion Ben made it sound like we could get around the coefficient access issue Dan mentioned though.

Intel has this “visual compute accelerator” which is an integrated CPU+GPU with 3 Xeon processors and a P580 GPU. The gpu is not crazy powerful like a volta, but since they are integrated you don’t pay a transfer cost which is rad.

The gpus are not that powerful, but the last time we did a big profiling of the gpu code I think Rok found that 70% of the time was spent doing memory transfers. If the CPU+GPU combo can dramatically reduce those transfer costs then I think we would end up with a pretty decent speedup. Back of the hand math, if the gpu process was 100ms and we reduce memory transfer costs by 95% while increasing the algorithm cost by 15% then I think we would end up with a 2.6x speedup relative to the current version. The current gpu code is about 8x faster so the gpu version with svm for 5K matrices would be 21x faster than the cpu version.

All those numbers are made up though so we would need to get our hands on one of these things to see what sort of speedups we would get.

This does seem better and I think it’s definitely feasible (main insight is that you have to call .eval() every time before you’re about to do coefficient-wise access in all math functions), it’s just a lot of work and I’m not sure who will undertake it when. I would not bet on it happening in the next year just due to the amount of work and lack of people, but who knows, someone could pop up and be inspired enough to take it on.

If the SVM stuff does end up being very good then I’m not 100% against putting up some of my own time for that

1 Like

But how do you avoid transfer costs? Or is the point that it’s scheduling them cleverly for you?

If you did something that changed the entire matrix (like multiplying all cells by 10 or something) then yeah idt there’s a way to hand wave the transfer cost.

I didn’t have time to read deeper into the spec this weekend, but it could be a clever transfer. For instance if we are computing with the CPU and can pass those changes over to the GPU while doing the computation then it sort of would be hidden.

I’m working on the cache stuff this week (getting some nice results I’ll post soon!) so after that I’ll have time to look at the SVM stuff more seriously.