Catching GPU errors, making new error codes or (?)

stevebronder · February 23, 2018, 5:18am

We currently have check_ocl_error() in the GPU PR which takes the OpenCL error codes and throws a domain error saying which error code you received. IE -4 is CL_MEM_OBJECT_ALLOCATION_FAILURE

We want domain_errors() only when they are recoverable (that’s what I remember being told at some point)
c++11 now has <system_error> where you can make your own error codes and system errors ala the blog below

Does the team think (2) is best for catching the OpenCL system errors? Thoughts and opinions?

Bob_Carpenter · February 23, 2018, 10:16pm

That’s right.

Stan operates primarily through exceptions. What we usually do with error codes when we find them is turn them into exceptions. What that exception will be is determined by who can catch it. For the math library, the throws should be invalid_argument or some other error besides domain_error if the exception is not going to be recoverable by our algorithms; if the exception is something that might be due to randomization and numerical issues, throw domain_error and the current execution will be halted and it’ll try again with new random numbers.

adam-erickson · February 24, 2018, 4:36pm

Out of curiosity, will Stan also run on CUDA? This seems preferable over OpenCL for a few reasons. CUDA is widely used by deep learning libraries rather than OpenCL.

Bob_Carpenter · February 27, 2018, 7:15am

No, but it should run on NVIDIA hardware. We didn’t want to go with a proprietary solution, but could always add support later.

seantalts · February 27, 2018, 1:07pm

This seems to be a good writeup of the tradeoffs between CUDA and OpenCL: https://wiki.tiker.net/CudaVsOpenCL

salmanulhaq · October 15, 2018, 10:39am

CUDA’s eco-system is (subjectively) bigger than OpenCL with so much already implemented which can be quickly brought into the project. I would like to gauge interest in writing CUDA versions of what’s already written in OpenCL?

stevebronder · October 15, 2018, 8:00pm

I don’t have the manpower to do this myself but we have discussed it in the past. I’m open to it though I’m not really sure how big of a benefit it would be. About 60%-70% of time rn is spent transferring data back and forth from host to device which we would also have under cuda. Their algorithms are a bit faster, but that’s not our bottleneck atm.

Bob_Carpenter · October 22, 2018, 1:25am

Weren’t @rok_cesnovar and @Erik_Strumbelj going to do this at some point?

Erik_Strumbelj · October 22, 2018, 1:30am

With our current resources, CUDA support is not a priority. However, we are trying to raise funding for it (in particular, from NVIDIA). If we’re successful, we can also find the manpower to do it.

Topic		Replies	Views
Compile-time errors for use of GPUs in cmdstanr Developers cmdstanr	1	379	August 29, 2023
GPU compilation error Modeling	1	345	February 14, 2023
OpenCL demo fails to run on a Linux General	1	599	July 11, 2022
GPUs on Mac OSX, Apple M1 Modeling mac , gpu	6	2168	May 2, 2023
CmdStanPy - terminated by signal 11 Developers	1	744	November 14, 2022

Catching GPU errors, making new error codes or (?)

Related topics