Overall design of GPU work?

I started reading through the implementation of the OpenCL work in the math library. I mostly expected some CL sources and interop code but then saw some sorts of kernel fusion are happening for amortizing memory access over multiple simple operations, which is pretty cool.

It occurred to me that this is the sort of thing that would be easier to do in the compiler (the new OCaml one is the only one I’ve read through a bit): it seems like the MIR or backend is the place where one has as lot of high level information and can fuse lots of arithmetic into a single fat kernel. It’d also be where one could speculatively generate forward gradient kernels for some functions.

I guess all of this has already been thought about but I am left wondering if there’s a roadmap since ultimately I’d like to be able to contribute, but I didn’t find anything or look in the right places. Thanks in advance!

1 Like

Hey would love to have you! There are two pieces of docs

  1. The Stan Math OpenCL Paper
  2. The Kernel Fusion Docs as well as the other docs under the OpenCL module and “OpenCL for Parallel Computing” under the Parallelism tab.

I’m not sure if we are looking at doing the kernel fusion directly in the compiler. Our current plan is to use the new var_value<matrix_cl<double>> type to handle GPU computation in reverse mode