I tried to search what this means, with not much luck, but it might be related to version problems.
What is the required version of OpenCL? It would be nice to mention that also on that wiki page.
Is there some other minimal requirements?
Instructions say " In cmdstan, an example model is provided in examples/GP/ , which uses OpenCL Cholesky decomposition. You can check if your OpenCL configuration works by trying to build it.", but I couldn’t find that example in cmdstan develop or any obvious branch. Where can I find it?
the example link needs to be fixed, this was copied from when we had an experimental branch with a gp example way back when. We should instead make a GPU GLM and GP examples. Will get on that in the following days. For now I removed the text that talks about the example that isnt there.
But seeing as you already have a model you want to try this with, that maybe isnt needed at this moment.
Can you give me some info on the device and OS you are using.
And presuming your make/local has the following right:
STAN_OPENCL=true
OPENCL_DEVICE_ID=0
OPENCL_PLATFORM_ID=0
Can you run clinfo to check if you have any other OpenCL-enabled devices on your system and the 0-0 index points to some other device. This is unlikely but just to make sure.
Regarding the version, OpenCL 1.2 is required, I do think that is mentioned somewhere, but not on the wiki (I added it now, thanks for the suggestion) and that should not be a problem as I have not seen any devices with 1.0 and 1.1 for at least 8 years.
Yes. Btw, I would be great that in context of Stan, it would be nice to always mention which of the different make directories is the one where this local should be edited (cmdstan/make/local, cmdstan/stan/make/local, cmdstan/stan/lib/stan_math/local???)
Did some time to figure how I can install it without admin rights…
Number of platforms 1
Platform Name NVIDIA CUDA
Platform Vendor NVIDIA Corporation
Platform Version OpenCL 1.2 CUDA 10.0.246
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer
Platform Extensions function suffix NV
Platform Name NVIDIA CUDA
Number of devices 1
Device Name GRID P40-2Q
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 410.92
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 02:00.1
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 30
Max clock frequency 1531MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 2147483648 (2GiB)
Error Correction support No
Max memory allocation 536870912 (512MiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 491520 (480KiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
Built-in kernels
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform
clCreateContext(NULL, ...) [default] No platform
clCreateContext(NULL, ...) [other] Success [NV]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) Invalid device type for platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform
Its always the make/local of cmdstan as that propagates down to Stan Math. I never had to edit any lower level make/local files whenever I worked with Cmdstan (OpenCL related or not). I do realize that is maybe not obvious to everyone. I added that note (thanks again for the suggestion).
Ok, so its the only device, which means 0 and 0 is fine. OpenCL is fine, double precision is also fine… Argh. I apologize this is wasting your time. I have never seen this error before :/. Your system most definitely does not have resource problems.
The next thing to try would be to eliminate that its a bug with the exact model you are trying this on. So compile the model above and I will post some fake data to go with it.
./runTests.py test/unit -f opencl
...
make: 'test/unit/math/opencl/rev/triangular_transpose_test' is up to date.
make: 'test/unit/math/opencl/rev/zeros_test' is up to date.
------------------------------------------------------------
test/unit/math/opencl/assign_event_test --gtest_output="xml:test/unit/math/opencl/assign_event_test.xml"
terminate called after throwing an instance of 'std::system_error'
what(): neg_binomial_2_log_glm: clBuildProgram CL_OUT_OF_HOST_MEMORY: Unknown error -6
Aborted
test/unit/math/opencl/assign_event_test --gtest_output="xml:test/unit/math/opencl/assign_event_test.xml" failed
exit now (01/28/20 10:48:34 EET)
test/interface/opencl_test --gtest_output="xml:test/interface/opencl_test.xml"
terminate called after throwing an instance of 'std::system_error'
what(): neg_binomial_2_log_glm: clBuildProgram CL_OUT_OF_HOST_MEMORY: Unknown error -6
Aborted
Yes, it’s virtual machine. The purpose of that instance is to make it easy to test GPU computing. It’s maintained by my university IT, so I can ask changes to get things to work, but I need to first know what to ask.
CL_OUT_OF_HOST_MEMORY would mean that there isnt enough resources to compile the OpenCL kernels (OpenCL kernels are compiled just-in-time at the start of any OpenCL-enabled Stan/Stan Math program).
4GB should be enough in general, but I do have to admit that I never observed how much RAM is used during tests. But given that non-OpenCL Stan alone uses a few GB of RAM to compile this might be it. Is there an easy option to request a GB or two of additional RAM?
Ok, it’s a memory problem. I closed emacs and R, and the test passes!
Running main() from stan/lib/stan_math/lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from StanUiCommand
[ RUN ] StanUiCommand.opencl_ready
[ OK ] StanUiCommand.opencl_ready (0 ms)
[----------] 1 test from StanUiCommand (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[ PASSED ] 1 test.
Wow, I have never seen kernel compilation to take that much RAM. Maybe there are some issues in compiler. You can try requesting an update for GPU driver and hope for the best.
I tried running the exact same test and was also monitoring the RAM usage and it barely makes a dent, under 100MB added to the baseline usage. Upgrading the NVIDIA driver might be the best way yes.