Catching GPU errors, making new error codes or (?)



We currently have check_ocl_error() in the GPU PR which takes the OpenCL error codes and throws a domain error saying which error code you received. IE -4 is CL_MEM_OBJECT_ALLOCATION_FAILURE

  1. We want domain_errors() only when they are recoverable (that’s what I remember being told at some point)
  2. c++11 now has <system_error> where you can make your own error codes and system errors ala the blog below

Does the team think (2) is best for catching the OpenCL system errors? Thoughts and opinions?


That’s right.

Stan operates primarily through exceptions. What we usually do with error codes when we find them is turn them into exceptions. What that exception will be is determined by who can catch it. For the math library, the throws should be invalid_argument or some other error besides domain_error if the exception is not going to be recoverable by our algorithms; if the exception is something that might be due to randomization and numerical issues, throw domain_error and the current execution will be halted and it’ll try again with new random numbers.


Out of curiosity, will Stan also run on CUDA? This seems preferable over OpenCL for a few reasons. CUDA is widely used by deep learning libraries rather than OpenCL.


No, but it should run on NVIDIA hardware. We didn’t want to go with a proprietary solution, but could always add support later.


This seems to be a good writeup of the tradeoffs between CUDA and OpenCL: