Catching GPU errors, making new error codes or (?)

math

#1

We currently have check_ocl_error() in the GPU PR which takes the OpenCL error codes and throws a domain error saying which error code you received. IE -4 is CL_MEM_OBJECT_ALLOCATION_FAILURE

  1. We want domain_errors() only when they are recoverable (that’s what I remember being told at some point)
  2. c++11 now has <system_error> where you can make your own error codes and system errors ala the blog below

Does the team think (2) is best for catching the OpenCL system errors? Thoughts and opinions?


#2

That’s right.

Stan operates primarily through exceptions. What we usually do with error codes when we find them is turn them into exceptions. What that exception will be is determined by who can catch it. For the math library, the throws should be invalid_argument or some other error besides domain_error if the exception is not going to be recoverable by our algorithms; if the exception is something that might be due to randomization and numerical issues, throw domain_error and the current execution will be halted and it’ll try again with new random numbers.


#3

Out of curiosity, will Stan also run on CUDA? This seems preferable over OpenCL for a few reasons. CUDA is widely used by deep learning libraries rather than OpenCL.


#4

No, but it should run on NVIDIA hardware. We didn’t want to go with a proprietary solution, but could always add support later.


#5

This seems to be a good writeup of the tradeoffs between CUDA and OpenCL: https://wiki.tiker.net/CudaVsOpenCL


#6

CUDA’s eco-system is (subjectively) bigger than OpenCL with so much already implemented which can be quickly brought into the project. I would like to gauge interest in writing CUDA versions of what’s already written in OpenCL?


#7

I don’t have the manpower to do this myself but we have discussed it in the past. I’m open to it though I’m not really sure how big of a benefit it would be. About 60%-70% of time rn is spent transferring data back and forth from host to device which we would also have under cuda. Their algorithms are a bit faster, but that’s not our bottleneck atm.


#8

Weren’t @rok_cesnovar and @Erik_Strumbelj going to do this at some point?


#9

With our current resources, CUDA support is not a priority. However, we are trying to raise funding for it (in particular, from NVIDIA). If we’re successful, we can also find the manpower to do it.