Cannot get GPU to work on macOS with Intel GPU

rmcminds · May 7, 2020, 12:52pm

Example model bernoulli.stan compiles with cmdstan 2.23.0 but does not sample when built with STAN_OPENCL=true on macOS 10.14.16 with Intel processor.

I’ve seen statements on a number of pages dealing with the GPU that it will only work with “suitable hardware (e.g. Nvidia or AMD gpu) that supports OpenCL 1.2”. Obviously my GPU is not Nvidia or AMD, but it supports OpenCL. If the GPU routines simply can’t be run with an Intel processor, don’t bother reading on - but could someone explain to me why that is?

Otherwise… I’m not sure whether my compilers are messed up or if there’s a bug or what. I installed cmdstan as explained on this page: https://github.com/stan-dev/cmdstan/wiki/Getting-Started-with-CmdStan. Before running “make build”, I added a file ‘./make/local’ with the lines

STAN_OPENCL=true
OPENCL_DEVICE_ID=1
OPENCL_PLATFORM_ID=0

“make/build” then seems to work fine (but I did notice the line ld: warning: directory not found for option ‘-L/Users/~/cmdstan/stan/lib/stan_math/lib/tbb’), and then “make examples/bernoulli/bernoulli” also works fine. However, when attempting to sample from the model with “examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R”, I get the following output:

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: compile_kernel: bernoulli_logit_glm <program source>:1:904: error: call to '__cl_isfinite' is ambiguous
__kernel void bernoulli_logit_glm( __global double* logp_global, __global double* theta_derivative_global, __global double* theta_derivative_sum, const __global int* y_global, const __global double* x, const __global double* alpha, const __global double* beta, const int N, const int M, const int is_y_vector, const int is_alpha_vector, const int need_theta_derivative, const int need_theta_derivative_sum) { const int gid = get_global_id(0); const int lid = get_local_id(0); const int lsize = get_local_size(0); const int wg_id = get_group_id(0); __local double local_storage[LOCAL_SIZE_]; double logp = 0; double theta_derivative = 0; if (gid < N) { double ytheta = 0; for (int i = 0, j = 0; i < M; i++, j += N) { ytheta += x[j + gid] * beta[i]; } const int y = y_global[gid * is_y_vector]; const double sign = 2 * y - 1; ytheta += alpha[gid * is_alpha_vector]; ytheta *= sign; if (y > 1 || y < 0 || !isfinite(ytheta)) { logp = NAN; } const double exp_m_ytheta = exp(-ytheta); const double cutoff = 20.0; if (ytheta > cutoff) { logp -= exp_m_ytheta; theta_derivative = -exp_m_ytheta; } else if (ytheta < -cutoff) { logp += ytheta; theta_derivative = sign; } else { logp += -log1p(exp_m_ytheta); theta_derivative = sign * exp_m_ytheta / (exp_m_ytheta + 1); } if (need_theta_derivative) { theta_derivative_global[gid] = theta_derivative; } } local_storage[lid] = logp; barrier(CLK_LOCAL_MEM_FENCE); for (int step = lsize / REDUCTION_STEP_SIZE; step > 0; step /= REDUCTION_STEP_SIZE) { if (lid < step) { for (int i = 1; i < REDUCTION_STEP_SIZE; i++) { local_storage[lid] += local_storage[lid + step * i]; } } barrier(CLK_LOCAL_MEM_FENCE); } if (lid == 0) { logp_global[wg_id] = local_storage[0]; } if (need_theta_derivative_sum) { barrier(CLK_LOCAL_MEM_FENCE); local_storage[lid] = theta_derivative; barrier(CLK_LOCAL_MEM_FENCE); for (int step = lsize / REDUCTION_STEP_SIZE; step > 0; step /= REDUCTION_STEP_SIZE) { if (lid < step) { for (int i = 1; i < REDUCTION_STEP_SIZE; i++) { local_storage[lid] += local_storage[lid + step * i]; } } barrier(CLK_LOCAL_MEM_FENCE); } if (lid == 0) { theta_derivative_sum[wg_id] = local_storage[0]; } } }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^~~~~~~~~~~~~~~~
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:5080:27: note: expanded from macro 'isfinite'
    #define isfinite(__x) __cl_isfinite(__x)
                          ^~~~~~~~~~~~~
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:5077:20: note: candidate function
__CLFN_IL_1FD_MODD(__cl_isfinite);
                   ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:598:51: note: expanded from macro '__CLFN_IL_1FD_MODD'
#define __CLFN_IL_1FD_MODD(name) int __OVERLOAD__ name(float x); \
                                                  ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:5077:20: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:599:19: note: expanded from macro '__CLFN_IL_1FD_MODD'
int2 __OVERLOAD__ name(float2 x); \
                  ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:5077:20: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:600:19: note: expanded from macro '__CLFN_IL_1FD_MODD'
int3 __OVERLOAD__ name(float3 x); \
                  ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:5077:20: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:601:19: note: expanded from macro '__CLFN_IL_1FD_MODD'
int4 __OVERLOAD__ name(float4 x); \
                  ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:5077:20: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:602:19: note: expanded from macro '__CLFN_IL_1FD_MODD'
int8 __OVERLOAD__ name(float8 x); \
                  ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:5077:20: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:603:20: note: expanded from macro '__CLFN_IL_1FD_MODD'
int16 __OVERLOAD__ name(float16 x);
                   ^
<program source>:1:966: error: call to '__fast_relax_exp' is ambiguous
__kernel void bernoulli_logit_glm( __global double* logp_global, __global double* theta_derivative_global, __global double* theta_derivative_sum, const __global int* y_global, const __global double* x, const __global double* alpha, const __global double* beta, const int N, const int M, const int is_y_vector, const int is_alpha_vector, const int need_theta_derivative, const int need_theta_derivative_sum) { const int gid = get_global_id(0); const int lid = get_local_id(0); const int lsize = get_local_size(0); const int wg_id = get_group_id(0); __local double local_storage[LOCAL_SIZE_]; double logp = 0; double theta_derivative = 0; if (gid < N) { double ytheta = 0; for (int i = 0, j = 0; i < M; i++, j += N) { ytheta += x[j + gid] * beta[i]; } const int y = y_global[gid * is_y_vector]; const double sign = 2 * y - 1; ytheta += alpha[gid * is_alpha_vector]; ytheta *= sign; if (y > 1 || y < 0 || !isfinite(ytheta)) { logp = NAN; } const double exp_m_ytheta = exp(-ytheta); const double cutoff = 20.0; if (ytheta > cutoff) { logp -= exp_m_ytheta; theta_derivative = -exp_m_ytheta; } else if (ytheta < -cutoff) { logp += ytheta; theta_derivative = sign; } else { logp += -log1p(exp_m_ytheta); theta_derivative = sign * exp_m_ytheta / (exp_m_ytheta + 1); } if (need_theta_derivative) { theta_derivative_global[gid] = theta_derivative; } } local_storage[lid] = logp; barrier(CLK_LOCAL_MEM_FENCE); for (int step = lsize / REDUCTION_STEP_SIZE; step > 0; step /= REDUCTION_STEP_SIZE) { if (lid < step) { for (int i = 1; i < REDUCTION_STEP_SIZE; i++) { local_storage[lid] += local_storage[lid + step * i]; } } barrier(CLK_LOCAL_MEM_FENCE); } if (lid == 0) { logp_global[wg_id] = local_storage[0]; } if (need_theta_derivative_sum) { barrier(CLK_LOCAL_MEM_FENCE); local_storage[lid] = theta_derivative; barrier(CLK_LOCAL_MEM_FENCE); for (int step = lsize / REDUCTION_STEP_SIZE; step > 0; step /= REDUCTION_STEP_SIZE) { if (lid < step) { for (int i = 1; i < REDUCTION_STEP_SIZE; i++) { local_storage[lid] += local_storage[lid + step * i]; } } barrier(CLK_LOCAL_MEM_FENCE); } if (lid == 0) { theta_derivative_sum[wg_id] = local_storage[0]; } } }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ^~~~~~~~~~~~
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4544:22: note: expanded from macro 'exp'
    #define exp(__x) __fast_relax_exp(__x)
                     ^~~~~~~~~~~~~~~~
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4542:30: note: candidate function
    __CLFN_FD_1FD_FAST_RELAX(__fast_relax_exp, native_exp, __cl_exp);
                             ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:417:27: note: expanded from macro '__CLFN_FD_1FD_FAST_RELAX'
inline float __OVERLOAD__ _name(float x) { return _default_name(x); } \
                          ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4542:30: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:418:28: note: expanded from macro '__CLFN_FD_1FD_FAST_RELAX'
inline float2 __OVERLOAD__ _name(float2 x) { return _default_name(x); } \
                           ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4542:30: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:419:28: note: expanded from macro '__CLFN_FD_1FD_FAST_RELAX'
inline float3 __OVERLOAD__ _name(float3 x) { return _default_name(x); } \
                           ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4542:30: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:420:28: note: expanded from macro '__CLFN_FD_1FD_FAST_RELAX'
inline float4 __OVERLOAD__ _name(float4 x) { return _default_name(x); } \
                           ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4542:30: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:421:28: note: expanded from macro '__CLFN_FD_1FD_FAST_RELAX'
inline float8 __OVERLOAD__ _name(float8 x) { return _default_name(x); } \
                           ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4542:30: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:422:29: note: expanded from macro '__CLFN_FD_1FD_FAST_RELAX'
inline float16 __OVERLOAD__ _name(float16 x){ return _default_name(x); }
                            ^
<program source>:1:1177: error: call to '__cl_log1p' is ambiguous
__kernel void bernoulli_logit_glm( __global double* logp_global, __global double* theta_derivative_global, __global double* theta_derivative_sum, const __global int* y_global, const __global double* x, const __global double* alpha, const __global double* beta, const int N, const int M, const int is_y_vector, const int is_alpha_vector, const int need_theta_derivative, const int need_theta_derivative_sum) { const int gid = get_global_id(0); const int lid = get_local_id(0); const int lsize = get_local_size(0); const int wg_id = get_group_id(0); __local double local_storage[LOCAL_SIZE_]; double logp = 0; double theta_derivative = 0; if (gid < N) { double ytheta = 0; for (int i = 0, j = 0; i < M; i++, j += N) { ytheta += x[j + gid] * beta[i]; } const int y = y_global[gid * is_y_vector]; const double sign = 2 * y - 1; ytheta += alpha[gid * is_alpha_vector]; ytheta *= sign; if (y > 1 || y < 0 || !isfinite(ytheta)) { logp = NAN; } const double exp_m_ytheta = exp(-ytheta); const double cutoff = 20.0; if (ytheta > cutoff) { logp -= exp_m_ytheta; theta_derivative = -exp_m_ytheta; } else if (ytheta < -cutoff) { logp += ytheta; theta_derivative = sign; } else { logp += -log1p(exp_m_ytheta); theta_derivative = sign * exp_m_ytheta / (exp_m_ytheta + 1); } if (need_theta_derivative) { theta_derivative_global[gid] = theta_derivative; } } local_storage[lid] = logp; barrier(CLK_LOCAL_MEM_FENCE); for (int step = lsize / REDUCTION_STEP_SIZE; step > 0; step /= REDUCTION_STEP_SIZE) { if (lid < step) { for (int i = 1; i < REDUCTION_STEP_SIZE; i++) { local_storage[lid] += local_storage[lid + step * i]; } } barrier(CLK_LOCAL_MEM_FENCE); } if (lid == 0) { logp_global[wg_id] = local_storage[0]; } if (need_theta_derivative_sum) { barrier(CLK_LOCAL_MEM_FENCE); local_storage[lid] = theta_derivative; barrier(CLK_LOCAL_MEM_FENCE); for (int step = lsize / REDUCTION_STEP_SIZE; step > 0; step /= REDUCTION_STEP_SIZE) { if (lid < step) { for (int i = 1; i < REDUCTION_STEP_SIZE; i++) { local_storage[lid] += local_storage[lid + step * i]; } } barrier(CLK_LOCAL_MEM_FENCE); } if (lid == 0) { theta_derivative_sum[wg_id] = local_storage[0]; } } }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ^~~~~~~~~~~~~~~~~~~
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4668:24: note: expanded from macro 'log1p'
    #define log1p(__x) __cl_log1p(__x)
                       ^~~~~~~~~~
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4665:15: note: candidate function
__CLFN_FD_1FD(__cl_log1p);
              ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:367:48: note: expanded from macro '__CLFN_FD_1FD'
#define __CLFN_FD_1FD(name) float __OVERLOAD__ name(float x); \
                                               ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4665:15: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:368:21: note: expanded from macro '__CLFN_FD_1FD'
float2 __OVERLOAD__ name(float2 x); \
                    ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4665:15: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:369:21: note: expanded from macro '__CLFN_FD_1FD'
float3 __OVERLOAD__ name(float3 x); \
                    ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4665:15: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:370:21: note: expanded from macro '__CLFN_FD_1FD'
float4 __OVERLOAD__ name(float4 x); \
                    ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4665:15: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:371:21: note: expanded from macro '__CLFN_FD_1FD'
float8 __OVERLOAD__ name(float8 x); \
                    ^
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:4665:15: note: candidate function
/System/Library/PrivateFrameworks/GPUCompiler.framework/Versions/A/lib/clang/3.5/include/cl_kernel.h:372:22: note: expanded from macro '__CLFN_FD_1FD'
float16 __OVERLOAD__ name(float16 x);
                     ^
: Unknown error -11
Abort trap: 6

_____
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin18.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

stevebronder · May 7, 2020, 9:44pm

@rok_cesnovar do we need to check of the compiler vendor has log1p etc macros defined?

stevebronder · May 8, 2020, 5:33am

Oh sorry, I think I’ve run into this before. Does your Intel GPU support operations on doubles? It’s trying to find the closest type to cast to but float, float2, etc are all candidates

If you do clinfo at the terminal your device info should have a section titled Double-precision Floating-point support. Does that have an (n/a) next to it?

rmcminds · May 8, 2020, 5:38am

Well this is interesting…

clinfo
Platform #0
Name: Apple
Version: OpenCL 1.2 (Jan 8 2020 15:35:01)

Device #0
Name: Intel® Core™ i7-8569U CPU @ 2.80GHz
Type: CPU
Version: OpenCL 1.2
Global memory size: 16 GB
Local memory size: 32 kB
Max work group size: 1024
Max work item sizes: (1024, 1, 1)

Device #1
Name: Intel® Iris™ Plus Graphics 655
Type: GPU
Version: OpenCL 1.2
Global memory size: 1 GB 512 MB
Local memory size: 64 kB
Max work group size: 256
Segmentation fault: 11

stevebronder · May 8, 2020, 5:43am

Hm that is weird, but I also think I had the same macbook pro at my old job and had this compiler error. I can’t find the spec but I’m pretty sure the Iris 655 does not support operations on doubles (which is what stan uses as it’s numeric type). For my problem I actually found a speedup using OpenCL and the CPU which was nice but ymmv.

rmcminds · May 8, 2020, 5:57am

Ahh, that is a major bummer. I’ll play with it on the CPU though and see if that speeds anything up. Thanks for your help!!

rmcminds · May 9, 2020, 9:09am

Is it likely (even a little) that OpenCL could speed things up even if all the CPU cores are busy (eg I have 4 cores and run 4 chains at a time)?

The example bernoulli model works for me when I specify that OpenCL use my CPU, but it’s actually running into a different error with my own model. I can start a new thread or report a bug for that…

Unrecoverable error evaluating the log probability at the initial value.
Exception: multiply: clEnqueueNDRangeKernel CL_INVALID_WORK_ITEM_SIZE: Unknown error -55

Where the relevant line in my model is a less-clean version of this:

W_norm[1:M[d],]
  = W[1:M[d],]
      - to_matrix(segment(modelMat, 1, MMplus[d]), M[d], Mplus[d] - M[d])
        * W_norm[(M[d] + 1):Mplus[d],];

This runs fine when the model is compiled without OpenCL

stevebronder · May 9, 2020, 4:15pm

If you are already using 4 chains amd 4 cores then your model is already going to be using 100% of your cpu power. Idt you would get a lot unless you had more cores

stevebronder · May 9, 2020, 4:58pm

That could be a bug, would you be able to file an issue with a small reproducible example?

rmcminds · May 10, 2020, 11:59am

I’ll see if I can reproduce it with something other than my large messy model…

Thanks again!

tadej · May 11, 2020, 10:39am

Max work item sizes: (1024, 1, 1)

@rmcminds Your CPU reports weirdly low Max work item size. AFAIK this is not some property of CPU, so it might be fixed by installing newer OpenCL driver for CPU (or possibly older as the date reported seems pretty recent). We currently need at least (32, 4, 1) for matrix multiplication, which AFAIK should be possible on virtually any device. I think we may be able to lower that requirement so it would work even on your setup.

maedoc · August 19, 2020, 9:37pm

CPU implementations of OpenCL have varying quality. Apple’s doesn’t support work group sizes greater than one if there’s a thread barrier. Intel’s is the best performing I’ve seen (8 core Haswell Xeon matching GTX 970 on massively parallel kernels) but not so easy to install. PortableCL is an option and should work everywhere if you can compile it.

Topic		Replies	Views
Issue using external GPU with cmdstanr and OpenCL Modeling fitting-issues	5	466	October 28, 2023
Make GPU examples cmdstan 2.20.0 Developers	13	864	December 11, 2019
GP Regression Example with Stan on GPU CmdStan stanc	1	745	February 9, 2020
Stan is not working on GPU in Linux Developers gpu	8	1682	March 17, 2021
Enable GPU in Stan Modeling rstan , techniques , gpu , cmdstanr	2	1870	January 23, 2024

Cannot get GPU to work on macOS with Intel GPU

Related topics