Running Stan on the GPU with OpenCL on WSL: Seeking Assistance

I am attempting to execute the instructions from Running Stan on the GPU with OpenCL • cmdstanr on Windows Subsystem for Linux 2 (WSL2). While the code runs, the execution time for model fitting on the GPU is not significantly different from using the CPU. I need to determine whether this is due to a configuration error on my part or if it is a limitation of my hardware.

The guide suggests setting the path to OpenCL.lib as path_to_opencl_lib <- "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.5/lib/x64" for Windows. However, in my WSL2 environment, I have installed CUDA 12.1, but there is no OpenCL.lib. Instead, there is a file named libOpenCL.so located in /usr/local/cuda-12.1/targets/x86_64-linux/lib, so I set the path as path_to_opencl_lib <- "/usr/local/cuda-12.1/targets/x86_64-linux/lib". With this path setting, both rebuilding cmdstanr (cmdstanr::rebuild_cmdstan()) and model fitting were successful.

I am using an NVIDIA RTX3060 GPU with 12GB of dedicated memory. However, when fitting the model on the GPU, the Mean chain execution time is 294.2 seconds, and the Total execution time is 298.0 seconds. In comparison, fitting the model on the CPUs results in a Mean chain execution time of 304.4 seconds and a Total execution time of 334.3 seconds. Therefore, there is not a significant speed improvement when using the GPU.

Could this be due to an incorrect path setting or a limitation of my GPU hardware? Any hints or guidance on how to diagnose and resolve this issue would be greatly appreciated.

MWE

# Set the path to OpenCL lib
path_to_opencl_lib <- "/usr/local/cuda-12.1/targets/x86_64-linux/lib" # This path contains libOpenCL.so
cpp_options = list(
  paste0("LDFLAGS+= -L\"",path_to_opencl_lib,"\" -lOpenCL")
)

cmdstanr::cmdstan_make_local(cpp_options = cpp_options)
cmdstanr::rebuild_cmdstan() # Even if the above path could be incorrect, this line did execute successfully.

library(cmdstanr)

# Make a simulational data
n <- 250000
k <- 20
X <- matrix(rnorm(n * k), ncol = k)
y <- rbinom(n, size = 1, prob = plogis(3 * X[,1] - 2 * X[,2] + 1))
mdata <- list(k = k, n = n, y = y, X = X)

# Make a temporal stan file
temp_stan_file <- tempfile(fileext = ".stan")

stan_url <- "https://raw.githubusercontent.com/stan-dev/cmdstanr/master/vignettes/articles-online-only/opencl-files/bernoulli_logit_glm.stan"

stan_code <- httr::GET(url = stan_url) |>
  httr::content(as = "text")

writeLines(stan_code, temp_stan_file)

# Compile a model using a GPU
mod_cl  <- cmdstan_model(
  temp_stan_file,
  cpp_options = list(stan_opencl = TRUE)
)
# Fit the model using the GPU
fit_cl <- mod_cl$sample(
  data = mdata,
  chains = 4,
  parallel_chains = 4,
  opencl_ids = c(0, 0),
  refresh = 0
)
# Compiling Stan program...
# Running MCMC with 4 parallel chains...
# 
# Chain 1 finished in 292.6 seconds.
# Chain 3 finished in 294.0 seconds.
# Chain 2 finished in 294.4 seconds.
# Chain 4 finished in 295.8 seconds.
# 
# All 4 chains finished successfully.
# Mean chain execution time: 294.2 seconds.
# Total execution time: 298.0 seconds.

# Compile a model using CPUs
mod <- cmdstan_model(
  temp_stan_file,
  force_recompile = TRUE
)
# Fit the model using CPUs
fit_cpu <- mod$sample(
  data = mdata,
  chains = 4,
  parallel_chains = 4,
  refresh = 0
)
# Compiling Stan program...
# Running MCMC with 4 parallel chains...
# 
# Chain 3 finished in 277.9 seconds.
# Chain 2 finished in 290.3 seconds.
# Chain 1 finished in 317.3 seconds.
# Chain 4 finished in 332.2 seconds.
# 
# All 4 chains finished successfully.
# Mean chain execution time: 304.4 seconds.
# Total execution time: 334.3 seconds.

# Compare exec times
fit_cpu$time()$total / fit_cl$time()$total
# 1.121795

Working environments

  • Operating System: Ubuntu 22.04 LTS on Windows Subsystem for Linux 2 (WSL2)
    • All of the following programmes were installed and work under WSL2 world, NOT native Windows world.
  • CmdStan Version: CmdStan v2.35.0
  • Compiler/Toolkit:
    • CUDA 12.1
    • GPU: NVIDIA RTX 3060 (12GB of the dedicated memory)
    • CPU: Intel Core i9-10980XE (18 cores, 36 threads)
    • R 4.4.1
      • cmdstanr: 0.8.1

How did you install CUDA for WSL? Nvidia has a guide which contains some steps that are different than “normal” linux: CUDA on WSL

Have a look at how high the load is on your graphics card, nvtop would be my recommendation. Not sure whether the Windows task manager might also show the resource usage on WSL.

Did you set the STAN_OPENCL=true flag in the make file?

It can also very well be that copying data to and from the GPU just takes too much time for any noticable speedup, but if so, there should be at least a little bit of load on the GPU.

@WardBrian Thank you for your assistance.

After your reply, I installed Ubuntu 24.04 on WSL2 and performed the following steps, including installing CUDA for WSL2. In my previous environment, Ubuntu 22.04, I had been experimenting with running large language models (LLMs) locally, which resulted in multiple CUDA installations and a convoluted PATHs. Therefore, I installed Ubuntu 24.04 afresh.

I managed to get cmdstanr running on Ubuntu 24.04 as well. However, the performance remains slower than I expected:

  1. Moved the folder from home/<user_name> on Ubuntu 22.04 to /mnt/c/Users/<user_name>/wsl-init.
  2. Installed Ubuntu 24.04 on WSL2.
  3. Moved the folder from /mnt/c/Users/<user_name>/wsl-init to home/<user_name> on Ubuntu 22.04.
  4. Installed the latest NVIDIA Driver from NVIDIA’s website.
  5. Verified that libcuda.so is only located in /usr/lib/wsl/lib/libcuda.so using find /usr/ -name libcuda.so.
  6. Followed the CUDA on WSL guide:
    1. Removed the existing key: sudo apt-key del 7fa2af80.
    2. Executed the following commands as per the CUDA 12.1.1 installation guide:
      wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
      sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
      wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
      sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
      sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
      sudo apt-get update
      sudo apt-get -y install cuda
      
  7. Set the PATH:
    echo 'export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}' >> ~/.bashrc
    echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}' >> ~/.bashrc
    
    source ~/.bashrc
    
  8. Configured WSL not to inherit the Windows PATH:
    1. Edited /etc/wsl.conf to add:
      [interop]
      appendWindowsPath = false
      
    2. Executed the following in Windows PowerShell:
      wsl.exe --shutdown
      
    3. Reboot Ubuntu 24.04
  9. Verified that libcuda.so exists in /usr/lib/wsl/lib/libcuda.so and /usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so.
  10. Installed essential build tools:
    sudo apt -y install build-essential gcc g++ make libtool texinfo dpkg-dev pkg-config gfortran
    
  11. Installed OpenBLAS following the OpenBLAS Wiki because I also wanted to install it:
    sudo apt update
    sudo apt install libopenblas-dev
    
  12. Attempted GPU model fitting in R, but encountered clGetPlatformIDs CL_PLATFORM_NOT_FOUND_KHR error. Fixed it by installing libpocl-dev:
    sudo apt install libpocl-dev
    
  13. Tried GPU model fitting in R again. All 4 chains finished successfully with a Mean chain execution time of 293.8 seconds and a Total execution time of 297.9 seconds. However, the performance is still comparable to using CPUs.
1 Like

@mippeqf Thank you for your reply too.

I have not used nvtop, but I can monitor GPU utilisation through Windows Task Manager. When I executed the code to fit the model using the GPU (as shown in the code of the initial post; mod_cl and fit_cl), only 5.3GB out of 12GB of dedicated GPU memory was utilised. The GPU temperature remained constant at 40°C.

I have set STAN_OPENCL=true when compiling the model in R using cmdstanr, as demonstrated in my initial post:

mod_cl  <- cmdstan_model(
  temp_stan_file,
  cpp_options = list(stan_opencl = TRUE)
)

Should I also use STAN_OPENCL=true during the installation or rebuilding of cmdstan (i.e., cmdstanr::rebuild_cmdstan())? Any additional guidance on this matter would be greatly appreciated.

How high was the GPU utilization (in percent)? My RTX 4070 is hardly ever fully utilized even for complex models because at every iteration, data has to be copied to and from the GPU. As long as the utilization rises noticably while Stan is running, I’d be quite confident that the interface is working. You can also try to raise the number of chains (and parallel_chains); that usually bumps up the GPU usage a lot.

On my hardware, it is necessary to set both the flag in the makefile and in the cpp_options (and rebuild_cmdstan() afterwards). Curiously, I’ve found that setting the cpp_options flag to FALSE behaved the same way as when it was set to TRUE, and the only way to turn off GPU support is to remove the cpp_options flag alltogether. Also note that the mechanism for automatically recompiling the model after making a change does not kick in when the cpp_options are changed, ie you need to manually recompile the model after changing the cpp_options.

1 Like

@mippeqf When running with 4 chains, the overall GPU memory usage, including both shared and dedicated memory, is around 4-7%. Increasing the number of chains to 20 did not change this. During the execution, indicated by messages such as Running MCMC with 4 chains, at most 4 in parallel... or Running MCMC with 20 chains, at most 20 in parallel..., the CPU usage reaches 100% rather than the GPU.

I am happy to hear that setting both STAN_OPENCL=true in the makefile and in cmdstanr’s cpp_options is effective. I would like to try this immediately, but how can I set STAN_OPENCL=true in the makefile? Could you provide guidance or point me to the relevant documentation?

Can you open the CPU tab in your task manager, right click the chart area, and select “Show kernel times”? If most of the chart area is a deeper blue while running the model, then I’d be rather confident in saying that the copying back and forth is the bottleneck. Light blue is work done on the CPU itself.

The make file for your cmdstan installation is usually in ~/.cmdstan/cmdstan-2.35.0/make/local, you can also find this with the cmdstan_path() command provided by the cmdstanr package. The local.example in the same folder gives examples for other optional flags. Do not touch the other files in the directory.

@mippeqf

I confirmed that the chart area which is present when I select “Show kernel times” on the CPU tab in the Task Manager was a deeper blue while running the model.

Moreover, when I attempted GPU model fitting after setting STAN_OPENCL=true in the make/local file for cmdstan and rebuilding cmdstan with rebuild_cmdstan(), there was no evidence of GPU usage, and the CPU utilisation reached 100%. What I have did is listed below:

(for 1. – 11., see my previous post)
12. Added STAN_OPENCL=true to root/.cmdstan/cmdstan-2.35.0/make/local. The content of the local file is as follows:
local CXXFLAGS += -Wno-deprecated-declarations LDFLAGS += -L"/usr/local/cuda-12.1/targets/x86_64-linux/lib" -lOpenCL STAN_OPENCL=true
13. Executed rebuild_cmdstan(cores = 34).
14. Attempted GPU model fitting in R (fit_cl). There was no evidence of GPU usage, and CPU utilisation reached 100% when the chains were running.
15. Installed nvidia-opencl-dev using sudo apt install nvidia-opencl-dev.
16. Executed rebuild_cmdstan(cores = 34).
17. Attempted GPU model fitting in R (fit_cl) again. There was still no evidence of GPU usage, and CPU utilisation reached 100% when the chains were running.


I have two questions related to path settings that I would like to ask both you and @WardBrian . If you have any insights, I will apreciate you letting me know.

  1. In the official article of CmdStanR, the path to OpenCL.lib within the Windows CUDA directory was specified before rebuilding cmdstan. However, on Ubuntu, particularly on WSL2, there is no OpenCL.lib. On my Ubuntu 24.04 on WSL2, there is a libOpenCL.so in /usr/local/cuda-12.1/targets/x86_64-linux/lib. Should I set the path to this libOpenCL.so and then rebuild cmdstan? I have already tried this approach, but I am still wondering whether it is right track…

    path_to_opencl_lib <- "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.5/lib/x64"
    cpp_options = list(
      paste0("LDFLAGS+= -L\"",path_to_opencl_lib,"\" -lOpenCL")
    )
    
    cmdstanr::cmdstan_make_local(cpp_options = cpp_options)
    cmdstanr::rebuild_cmdstan()
    
  2. Could you please share your Ubuntu path settings, if you have any PATH settings related to NVIDIA/CUDA that you are able to disclose? I would just like to know if the path to CUDA (e.g., /usr/local/cuda-12.1/bin) is included in PATH variable and if the path to CUDA (e.g., /usr/local/cuda-12.1/lib64) is included in LD_LIBRARY_PATH variable. I want to see whether my PATH settings are right.

The OpenCL.so is exactly what you need. I don’t think it’s necessary to have the cuda bin folder in your path, but it definitely doesn’t hurt. The LD_LIBRARY_PATH I have also set to the lib64 folder, but to be sure, you can also add the one that is named just lib.

The only other thing I can think of is to run clinfo and check whether your RTX is actually the device with ID 0 on platform 0, and else adjust the opencl_ids argument. If that doesn’t work, I’m out of ideas, and the only thing I can suggest is to have someone take a look at your setup.

@mippeqf

Thank you for your reply. I’m relieved to hear that my path settings seem to be correct. Just to be thorough, here are my PATH and LD_LIBRARY_PATH settings:

$ echo $PATH

/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin:/usr/local/cuda-12.1/bin::/opt/quarto/bin:/home/<USER_NAME>/.vscode-server/bin/<SOME_HASH>/bin/remote-cli:/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin:/usr/local/cuda-12.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/snap/bin:/usr/bin/R:/usr/bin/R
$ echo $LD_LIBRARY_PATH

/usr/local/cuda-12.1/lib64::/usr/lib/wsl/lib:/usr/local/cuda-12.1/lib64::/usr/lib/wsl/lib:

Previously, clinfo did not recognise the GPU, but only the CPU. I followed the steps to install PoCL as outlined below. As a result, the GPU was recognised by clinfo, but when I tried model fitting with cmdstanr (i.e., running mod_cl$sample(...)), I encountered an error for all chains: Chain <CHAIN_NUMBER> OpenCL Initialization: [Device] CL_INVALID_DEVICE: Unknown error -1.

Here is an update on the progress:

  1. Since clinfo did not recognise the GPU, only the CPU, I followed the steps from this GitHub issue comment to install PoCL.
    1. Executed the following commands to install PoCL according to the official PoCL installation guide:
      export LLVM_VERSION=18
      apt install -y python3-dev libpython3-dev build-essential ocl-icd-libopencl1 \
          cmake git pkg-config libclang-${LLVM_VERSION}-dev clang-${LLVM_VERSION} \
          llvm-${LLVM_VERSION} make ninja-build ocl-icd-libopencl1 ocl-icd-dev \
          ocl-icd-opencl-dev libhwloc-dev zlib1g zlib1g-dev clinfo dialog apt-utils \
          libxml2-dev libclang-cpp${LLVM_VERSION}-dev libclang-cpp${LLVM_VERSION} \
          llvm-${LLVM_VERSION}-dev
      
    2. Downloaded PoCL:
      wget https://github.com/pocl/pocl/archive/refs/tags/v6.0.tar.gz
      
    3. Extracted the tarball:
      tar -xzvf v6.0.tar.gz
      
    4. Changed to the PoCL directory:
      cd pocl-6.0
      
    5. Created a build directory:
      mkdir build
      
    6. Built PoCL following the instructions from the GitHub issue comment:
      cmake -B build \
          -DCMAKE_C_FLAGS=-L/usr/lib/wsl/lib \
          -DCMAKE_CXX_FLAGS=-L/usr/lib/wsl/lib \
          -DENABLE_HOST_CPU_DEVICES=ON \
          -DENABLE_CUDA=ON
      
    7. Compiled PoCL:
      cmake --build build -j34
      
    8. Added environment variables to .bashrc:
      echo 'export POCL_BUILDING=1' >> ~/.bashrc
      echo 'export OCL_ICD_VENDORS=<FULL_PATH_OF_MY_HOME_DIR>/pocl-6.0/build/ocl-vendors/' >> ~/.bashrc
      
      source ~/.bashrc
      
    9. Installed PoCL:
      cmake --install build
      
    10. Verified GPU recognition with clinfo --list:
      $ clinfo --list
      
      Platform #0: Portable Computing Language
          +-- Device #0: cpu-cascadelake-Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
          `-- Device #1: NVIDIA GeForce RTX 3060
      
  2. Knowing that the GPU is Platform #0, Device #1, I rebuilt cmdstan and modified fit_cl as follows, but encountered the error Chain <CHAIN_NUMBER> OpenCL Initialization: [Device] CL_INVALID_DEVICE: Unknown error -1 for all chains:
    fit_cl <- mod_cl$sample(
      data = mdata,
      chains = 4,
      parallel_chains = 4,
      opencl_ids = c(0, 1), # Indicating my GPU is Platform #0, Device #1
      refresh = 0
    )
    

Thank you again for all your ideas and assistance.

Does clinfo also list the OpenCL version if you drpo the --list? I’m not sure how this works in detail, but I suppose OpenCL has to recognize the hardware in order for Stan to be able to use the GPU via OpenCL. Did you also recompile the model?

As shown below, for the GPU, Device Version is OpenCL 3.0 PoCL HSTR: CUDA-sm_75, and Device OpenCL C Version is OpenCL C 1.2 PoCL. OpenCL seems to recognise my GPU, doesn’t it?

Moreover, I recompiled the model by setting force_recompile = TRUE in cmdstan_model() but still fitting a model with opencl_ids = c(0, 1), fails…

$ clinfo

Number of platforms                               1
  Platform Name                                   Portable Computing Language
  Platform Vendor                                 The pocl project
  Platform Version                                OpenCL 3.0 PoCL 6.0  Linux, RelWithDebInfo, RELOC, LLVM 18.1.3, SLEEF, CUDA, POCL_DEBUG
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_khr_priority_hints cl_khr_throttle_hints cl_pocl_content_size cl_ext_buffer_device_address
  Platform Extensions with Version                cl_khr_icd                                                       0x400000 (1.0.0)
                                                  cl_khr_priority_hints                                            0x400000 (1.0.0)
                                                  cl_khr_throttle_hints                                            0x400000 (1.0.0)
                                                  cl_pocl_content_size                                             0x400000 (1.0.0)
                                                  cl_ext_buffer_device_address                                       0x1000 (0.1.0)
  Platform Numeric Version                        0xc00000 (3.0.0)
  Platform Extensions function suffix             POCL
  Platform Host timer resolution                  1ns

  Platform Name                                   Portable Computing Language
Number of devices                                 2
  Device Name                                     cpu-cascadelake-Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  Device Vendor                                   GenuineIntel
  Device Vendor ID                                0x10006
  Device Version                                  OpenCL 3.0 PoCL HSTR: cpu-x86_64-pc-linux-gnu-cascadelake
  Device Numeric Version                          0xc00000 (3.0.0)
  Driver Version                                  6.0
  Device OpenCL C Version                         OpenCL C 1.2 PoCL
  Device OpenCL C all versions                    OpenCL C                                                         0x400000 (1.0.0)
                                                  OpenCL C                                                         0x401000 (1.1.0)
                                                  OpenCL C                                                         0x402000 (1.2.0)
                                                  OpenCL C                                                         0xc00000 (3.0.0)
  Device OpenCL C features                        __opencl_c_3d_image_writes                                       0xc00000 (3.0.0)
                                                  __opencl_c_images                                                0xc00000 (3.0.0)
                                                  __opencl_c_atomic_order_acq_rel                                  0xc00000 (3.0.0)
                                                  __opencl_c_atomic_order_seq_cst                                  0xc00000 (3.0.0)
                                                  __opencl_c_atomic_scope_device                                   0xc00000 (3.0.0)
                                                  __opencl_c_program_scope_global_variables                        0xc00000 (3.0.0)
                                                  __opencl_c_atomic_scope_all_devices                              0xc00000 (3.0.0)
                                                  __opencl_c_generic_address_space                                 0xc00000 (3.0.0)
                                                  __opencl_c_work_group_collective_functions                       0xc00000 (3.0.0)
                                                  __opencl_c_read_write_images                                     0xc00000 (3.0.0)
                                                  __opencl_c_subgroups                                             0xc00000 (3.0.0)
                                                  __opencl_c_fp16                                                  0xc00000 (3.0.0)
                                                  __opencl_c_fp64                                                  0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp32_global_atomic_add                            0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp32_local_atomic_add                             0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp32_global_atomic_min_max                        0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp32_local_atomic_min_max                         0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp64_global_atomic_add                            0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp64_local_atomic_add                             0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp64_global_atomic_min_max                        0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp64_local_atomic_min_max                         0xc00000 (3.0.0)
                                                  __opencl_c_int64                                                 0xc00000 (3.0.0)
  Latest conformance test passed                  v2022-04-19-01
  Device Type                                     CPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               36
  Max clock frequency                             2999MHz
  Device Partition                                (core)
    Max number of sub-devices                     36
    Supported partition types                     equally, by counts
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             4096x4096x4096
  Max work group size                             4096
  Preferred work group size multiple (device)     8
  Preferred work group size multiple (kernel)     8
  Max sub-groups per work group                   128
  Sub-group sizes (Intel)                         1, 2, 4, 8, 16, 32, 64, 128, 256, 512
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                               16 / 16      
    int                                                 16 / 16      
    long                                                 8 / 8       
    half                                                16 / 16       (cl_khr_fp16)
    float                                               16 / 16      
    double                                               8 / 8        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              65114288128 (60.64GiB)
  Error Correction support                        No
  Max memory allocation                           17179869184 (16GiB)
  Unified memory for Host and Device              Yes
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   Yes
    Atomics                                       Yes
  Unified Shared Memory (USM)                     (cl_intel_unified_shared_memory)
  Host USM capabilities (Intel)                   USM access, USM atomic access
  Device USM capabilities (Intel)                 USM access, USM atomic access
  Single-Device USM caps (Intel)                  USM access, USM atomic access
  Cross-Device USM caps (Intel)                   (n/a)
  Shared System USM caps (Intel)                  (n/a)
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Preferred alignment for atomics                 
    SVM                                           64 bytes
    Global                                        64 bytes
    Local                                         64 bytes
  Atomic memory capabilities                      relaxed, acquire/release, sequentially-consistent, work-group scope, device scope, all-devices scope
  Atomic fence capabilities                       relaxed, acquire/release, sequentially-consistent, work-item scope, work-group scope, device scope
  Max size for global variable                    64000 (62.5KiB)
  Preferred total size of global vars             1048576 (1024KiB)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        25952256 (24.75MiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            1073741824 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   0 bytes
    Pitch alignment for 2D image buffers          0 pixels
    Max 2D image size                             32768x32768 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                128
    Max number of read/write image args           128
  Pipe support                                    No
  Max number of pipe args                         0
  Max active pipe reservations                    0
  Max pipe packet size                            0
  Local memory type                               Global
  Local memory size                               1048576 (1024KiB)
  Max number of constant args                     8
  Max constant buffer size                        1048576 (1024KiB)
  Generic address space support                   Yes
  Max size of kernel argument                     1024
  Queue properties (on host)                      
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Device enqueue capabilities                     (n/a)
  Queue properties (on device)                    
    Out-of-order execution                        No
    Profiling                                     No
    Preferred size                                0
    Max size                                      0
  Max queues on device                            0
  Max events on device                            0
  Command buffer capabilities                     kernel printf, simultaneous use, out of order, 0x10
    Required queue properties for command buffer  
    Out-of-order execution                        No
    Profiling                                     No
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    Non-uniform work-groups                       No
    Work-group collective functions               Yes
    Sub-group independent forward progress        Yes
    IL version                                    (n/a)
    ILs with version                              (n/a)
  printf() buffer size                            16777216 (16MiB)
  Built-in kernels                                pocl.add.i8;org.khronos.openvx.scale_image.nn.u8;org.khronos.openvx.scale_image.bl.u8;org.khronos.openvx.tensor_convert_depth.wrap.u8.f32
  Built-in kernels with version                   pocl.add.i8                                                      0x402000 (1.2.0)
                                                  org.khronos.openvx.scale_image.nn.u8                             0x402000 (1.2.0)
                                                  org.khronos.openvx.scale_image.bl.u8                             0x402000 (1.2.0)
                                                  org.khronos.openvx.tensor_convert_depth.wrap.u8.f32              0x402000 (1.2.0)
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_command_buffer cl_khr_command_buffer_multi_device cl_khr_subgroups cl_intel_unified_shared_memory cl_ext_buffer_device_address       cl_pocl_svm_rect cl_pocl_command_buffer_svm       cl_pocl_command_buffer_host_buffer cl_khr_subgroup_ballot cl_khr_subgroup_shuffle cl_intel_subgroups cl_intel_subgroups_short cl_ext_float_atomics cl_intel_required_subgroup_size cl_khr_fp16 cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics
  Device Extensions with Version                  cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_khr_3d_image_writes                                           0x400000 (1.0.0)
                                                  cl_khr_command_buffer                                              0x9004 (0.9.4)
                                                  cl_khr_command_buffer_multi_device                                 0x9001 (0.9.1)
                                                  cl_khr_subgroups                                                 0x400000 (1.0.0)
                                                  cl_intel_unified_shared_memory                                   0x400000 (1.0.0)
                                                  cl_ext_buffer_device_address                                       0x1000 (0.1.0)
                                                  cl_pocl_svm_rect                                                   0x9000 (0.9.0)
                                                  cl_pocl_command_buffer_svm                                         0x9000 (0.9.0)
                                                  cl_pocl_command_buffer_host_buffer                                 0x9000 (0.9.0)
                                                  cl_khr_subgroup_ballot                                           0x400000 (1.0.0)
                                                  cl_khr_subgroup_shuffle                                          0x400000 (1.0.0)
                                                  cl_intel_subgroups                                               0x400000 (1.0.0)
                                                  cl_intel_subgroups_short                                         0x400000 (1.0.0)
                                                  cl_ext_float_atomics                                             0x400000 (1.0.0)
                                                  cl_intel_required_subgroup_size                                  0x400000 (1.0.0)
                                                  cl_khr_fp16                                                      0x400000 (1.0.0)
                                                  cl_khr_fp64                                                      0x400000 (1.0.0)
                                                  cl_khr_int64_base_atomics                                        0x400000 (1.0.0)
                                                  cl_khr_int64_extended_atomics                                    0x400000 (1.0.0)

...<CONTINUED>

And this is the GPU part of clinfo:

 Device Name                                     NVIDIA GeForce RTX 3060
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 3.0 PoCL HSTR: CUDA-sm_75
  Device Numeric Version                          0xc00000 (3.0.0)
  Driver Version                                  6.0
  Device OpenCL C Version                         OpenCL C 1.2 PoCL
  Device OpenCL C all versions                    OpenCL C                                                         0x400000 (1.0.0)
                                                  OpenCL C                                                         0x401000 (1.1.0)
                                                  OpenCL C                                                         0x402000 (1.2.0)
                                                  OpenCL C                                                         0xc00000 (3.0.0)
  Device OpenCL C features                        __opencl_c_images                                                0xc00000 (3.0.0)
                                                  __opencl_c_atomic_order_acq_rel                                  0xc00000 (3.0.0)
                                                  __opencl_c_atomic_order_seq_cst                                  0xc00000 (3.0.0)
                                                  __opencl_c_atomic_scope_device                                   0xc00000 (3.0.0)
                                                  __opencl_c_program_scope_global_variables                        0xc00000 (3.0.0)
                                                  __opencl_c_generic_address_space                                 0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp32_global_atomic_add                            0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp32_local_atomic_add                             0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp32_global_atomic_min_max                        0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp32_local_atomic_min_max                         0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp64_global_atomic_add                            0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp64_local_atomic_add                             0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp64_global_atomic_min_max                        0xc00000 (3.0.0)
                                                  __opencl_c_ext_fp64_local_atomic_min_max                         0xc00000 (3.0.0)
                                                  __opencl_c_fp16                                                  0xc00000 (3.0.0)
                                                  __opencl_c_fp64                                                  0xc00000 (3.0.0)
  Latest conformance test passed                  (n/a)
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 0000:65:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               28
  Max clock frequency                             1777MHz
  Compute Capability (NV)                         8.6
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple (device)     32
  Preferred work group size multiple (kernel)     32
  Warp size (NV)                                  32
  Max sub-groups per work group                   32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (cl_khr_fp16)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              12884377600 (12GiB)
  Error Correction support                        No
  Max memory allocation                           11793334272 (10.98GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Preferred alignment for atomics                 
    SVM                                           64 bytes
    Global                                        64 bytes
    Local                                         64 bytes
  Atomic memory capabilities                      relaxed, work-group scope
  Atomic fence capabilities                       relaxed, acquire/release, work-group scope
  Max size for global variable                    0
  Preferred total size of global vars             0
  Global Memory cache type                        None
  Image support                                   No
  Pipe support                                    No
  Max number of pipe args                         0
  Max active pipe reservations                    0
  Max pipe packet size                            0
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     8
  Max constant buffer size                        65536 (64KiB)
  Generic address space support                   Yes
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties (on host)                      
    Out-of-order execution                        No
    Profiling                                     Yes
  Device enqueue capabilities                     (n/a)
  Queue properties (on device)                    
    Out-of-order execution                        No
    Profiling                                     No
    Preferred size                                0
    Max size                                      0
  Max queues on device                            0
  Max events on device                            0
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Non-uniform work-groups                       No
    Work-group collective functions               No
    Sub-group independent forward progress        Yes
    Kernel execution timeout (NV)                 Yes
    Concurrent copy and kernel execution (NV)     Yes
      Number of async copy engines                5
    IL version                                    (n/a)
    ILs with version                              (n/a)
  printf() buffer size                            16777216 (16MiB)
  Built-in kernels                                pocl.mul.i32;pocl.add.i32;pocl.dnn.conv2d_int8_relu;pocl.sgemm.local.f32;pocl.sgemm.tensor.f16f16f32;pocl.sgemm_ab.tensor.f16f16f32;pocl.abs.f32;pocl.add.i8;org.khronos.openvx.scale_image.nn.u8;org.khronos.openvx.scale_image.bl.u8;org.khronos.openvx.tensor_convert_depth.wrap.u8.f32
  Built-in kernels with version                   pocl.mul.i32                                                     0x402000 (1.2.0)
                                                  pocl.add.i32                                                     0x402000 (1.2.0)
                                                  pocl.dnn.conv2d_int8_relu                                        0x402000 (1.2.0)
                                                  pocl.sgemm.local.f32                                             0x402000 (1.2.0)
                                                  pocl.sgemm.tensor.f16f16f32                                      0x402000 (1.2.0)
                                                  pocl.sgemm_ab.tensor.f16f16f32                                   0x402000 (1.2.0)
                                                  pocl.abs.f32                                                     0x402000 (1.2.0)
                                                  pocl.add.i8                                                      0x402000 (1.2.0)
                                                  org.khronos.openvx.scale_image.nn.u8                             0x402000 (1.2.0)
                                                  org.khronos.openvx.scale_image.bl.u8                             0x402000 (1.2.0)
                                                  org.khronos.openvx.tensor_convert_depth.wrap.u8.f32              0x402000 (1.2.0)
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics     cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics     cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics     cl_khr_int64_extended_atomics cl_nv_device_attribute_query cl_ext_float_atomics cl_khr_fp16 cl_khr_fp64 cl_ext_buffer_device_address cl_khr_subgroup_ballot cl_khr_subgroup_shuffle
  Device Extensions with Version                  cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_khr_int64_base_atomics                                        0x400000 (1.0.0)
                                                  cl_khr_int64_extended_atomics                                    0x400000 (1.0.0)
                                                  cl_nv_device_attribute_query                                     0x400000 (1.0.0)
                                                  cl_ext_float_atomics                                             0x400000 (1.0.0)
                                                  cl_khr_fp16                                                      0x400000 (1.0.0)
                                                  cl_khr_fp64                                                      0x400000 (1.0.0)
                                                  cl_ext_buffer_device_address                                       0x1000 (0.1.0)
                                                  cl_khr_subgroup_ballot                                           0x400000 (1.0.0)
                                                  cl_khr_subgroup_shuffle                                          0x400000 (1.0.0)

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [POCL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   cpu-cascadelake-Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   cpu-cascadelake-Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   NVIDIA GeForce RTX 3060
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (2)
    Platform Name                                 Portable Computing Language
    Device Name                                   cpu-cascadelake-Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
    Device Name                                   NVIDIA GeForce RTX 3060

I converted all output of clinfo into a table:

cpu-cascadelake-Intel(R) Core™ i9-10980XE CPU @ 3.00GHz NVIDIA GeForce RTX 3060
Device Name cpu-cascadelake-Intel(R) Core™ i9-10980XE CPU @ 3.00GHz NVIDIA GeForce RTX 3060
Device Vendor GenuineIntel NVIDIA Corporation
Device Vendor ID 0x10006 0x10de
Device Version OpenCL 3.0 PoCL HSTR: cpu-x86_64-pc-linux-gnu-cascadelake OpenCL 3.0 PoCL HSTR: CUDA-sm_75
Device Numeric Version 0xc00000 (3.0.0) 0xc00000 (3.0.0)
Driver Version 6.0 6.0
Device OpenCL C Version OpenCL C 1.2 PoCL OpenCL C 1.2 PoCL
Device OpenCL C all versions OpenCL C 0x400000 (1.0.0), OpenCL C 0x401000 (1.1.0), OpenCL C 0x402000 (1.2.0), OpenCL C 0xc00000 (3.0.0) OpenCL C 0x400000 (1.0.0), OpenCL C 0x401000 (1.1.0), OpenCL C 0x402000 (1.2.0), OpenCL C 0xc00000 (3.0.0)
Device OpenCL C features __opencl_c_3d_image_writes 0xc00000 (3.0.0), __opencl_c_images 0xc00000 (3.0.0), __opencl_c_atomic_order_acq_rel 0xc00000 (3.0.0), __opencl_c_atomic_order_seq_cst 0xc00000 (3.0.0), __opencl_c_atomic_scope_device 0xc00000 (3.0.0), __opencl_c_program_scope_global_variables 0xc00000 (3.0.0), __opencl_c_atomic_scope_all_devices 0xc00000 (3.0.0), __opencl_c_generic_address_space 0xc00000 (3.0.0), __opencl_c_work_group_collective_functions 0xc00000 (3.0.0), __opencl_c_read_write_images 0xc00000 (3.0.0), __opencl_c_subgroups 0xc00000 (3.0.0), __opencl_c_fp16 0xc00000 (3.0.0), __opencl_c_fp64 0xc00000 (3.0.0), __opencl_c_ext_fp32_global_atomic_add 0xc00000 (3.0.0), __opencl_c_ext_fp32_local_atomic_add 0xc00000 (3.0.0), __opencl_c_ext_fp32_global_atomic_min_max 0xc00000 (3.0.0), __opencl_c_ext_fp32_local_atomic_min_max 0xc00000 (3.0.0), __opencl_c_ext_fp64_global_atomic_add 0xc00000 (3.0.0), __opencl_c_ext_fp64_local_atomic_add 0xc00000 (3.0.0), __opencl_c_ext_fp64_global_atomic_min_max 0xc00000 (3.0.0), __opencl_c_ext_fp64_local_atomic_min_max 0xc00000 (3.0.0), __opencl_c_int64 0xc00000 (3.0.0) __opencl_c_images 0xc00000 (3.0.0), __opencl_c_atomic_order_acq_rel 0xc00000 (3.0.0), __opencl_c_atomic_order_seq_cst 0xc00000 (3.0.0), __opencl_c_atomic_scope_device 0xc00000 (3.0.0), __opencl_c_program_scope_global_variables 0xc00000 (3.0.0), __opencl_c_generic_address_space 0xc00000 (3.0.0), __opencl_c_ext_fp32_global_atomic_add 0xc00000 (3.0.0), __opencl_c_ext_fp32_local_atomic_add 0xc00000 (3.0.0), __opencl_c_ext_fp32_global_atomic_min_max 0xc00000 (3.0.0), __opencl_c_ext_fp32_local_atomic_min_max 0xc00000 (3.0.0), __opencl_c_ext_fp64_global_atomic_add 0xc00000 (3.0.0), __opencl_c_ext_fp64_local_atomic_add 0xc00000 (3.0.0), __opencl_c_ext_fp64_global_atomic_min_max 0xc00000 (3.0.0), __opencl_c_ext_fp64_local_atomic_min_max 0xc00000 (3.0.0), __opencl_c_fp16 0xc00000 (3.0.0), __opencl_c_fp64 0xc00000 (3.0.0)
Latest conformance test passed v2022-04-19-01 (n/a)
Device Type CPU GPU
Device Profile FULL_PROFILE FULL_PROFILE
Device Available Yes Yes
Compiler Available Yes Yes
Linker Available Yes Yes
Max compute units 36 28
Max clock frequency 2999MHz 1777MHz
Device Partition (core) Max number of sub-devices 36, Supported partition types equally, by counts, Supported affinity domains (n/a) (core) Max number of sub-devices 1, Supported partition types None, Supported affinity domains (n/a)
Max work item dimensions 3 3
Max work item sizes 4096x4096x4096 1024x1024x64
Max work group size 4096 1024
Preferred work group size multiple (device) 8 32
Preferred work group size multiple (kernel) 8 32
Max sub-groups per work group 128 32
Sub-group sizes (Intel) 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 (n/a)
Preferred / native vector sizes char 16 / 16, short 16 / 16, int 16 / 16, long 8 / 8, half 16 / 16 (cl_khr_fp16), float 16 / 16, double 8 / 8 (cl_khr_fp64) char 1 / 1, short 1 / 1, int 1 / 1, long 1 / 1, half 0 / 0 (cl_khr_fp16), float 1 / 1, double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (cl_khr_fp16) Denormals No, Infinity and NANs Yes, Round to nearest Yes, Round to zero No, Round to infinity No, IEEE754-2008 fused multiply-add No, Support is emulated in software No (cl_khr_fp16) Denormals No, Infinity and NANs Yes, Round to nearest Yes, Round to zero No, Round to infinity No, IEEE754-2008 fused multiply-add No, Support is emulated in software No
Single-precision Floating-point support (core) Denormals Yes, Infinity and NANs Yes, Round to nearest Yes, Round to zero Yes, Round to infinity Yes, IEEE754-2008 fused multiply-add Yes, Support is emulated in software No, Correctly-rounded divide and sqrt operations Yes (core) Denormals Yes, Infinity and NANs Yes, Round to nearest Yes, Round to zero Yes, Round to infinity Yes, IEEE754-2008 fused multiply-add Yes, Support is emulated in software No, Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (cl_khr_fp64) Denormals Yes, Infinity and NANs Yes, Round to nearest Yes, Round to zero Yes, Round to infinity Yes, IEEE754-2008 fused multiply-add Yes, Support is emulated in software No (cl_khr_fp64) Denormals Yes, Infinity and NANs Yes, Round to nearest Yes, Round to zero Yes, Round to infinity Yes, IEEE754-2008 fused multiply-add Yes, Support is emulated in software No
Address bits 64, Little-Endian 64, Little-Endian
Global memory size 65114288128 (60.64GiB) 12884377600 (12GiB)
Error Correction support No No
Max memory allocation 17179869184 (16GiB) 11793334272 (10.98GiB)
Unified memory for Host and Device Yes No
Shared Virtual Memory (SVM) capabilities (core) Coarse-grained buffer sharing Yes, Fine-grained buffer sharing Yes, Fine-grained system sharing Yes, Atomics Yes (core) Coarse-grained buffer sharing Yes, Fine-grained buffer sharing Yes, Fine-grained system sharing No, Atomics No
Unified Shared Memory (USM) (cl_intel_unified_shared_memory) (n/a)
Host USM capabilities (Intel) USM access, USM atomic access (n/a)
Device USM capabilities (Intel) USM access, USM atomic access (n/a)
Single-Device USM caps (Intel) USM access, USM atomic access (n/a)
Cross-Device USM caps (Intel) (n/a) (n/a)
Shared System USM caps (Intel) (n/a) (n/a)
Minimum alignment for any data type 128 bytes 128 bytes
Alignment of base address 1024 bits (128 bytes) 4096 bits (512 bytes)
Preferred alignment for atomics SVM 64 bytes, Global 64 bytes, Local 64 bytes SVM 64 bytes, Global 64 bytes, Local 64 bytes
Atomic memory capabilities relaxed, acquire/release, sequentially-consistent, work-group scope, device scope, all-devices scope relaxed, work-group scope
Atomic fence capabilities relaxed, acquire/release, sequentially-consistent, work-item scope, work-group scope, device scope relaxed, acquire/release, work-group scope
Max size for global variable 64000 (62.5KiB) 0
Preferred total size of global vars 1048576 (1024KiB) 0
Global Memory cache type Read/Write None
Global Memory cache size 25952256 (24.75MiB) (n/a)
Global Memory cache line size 64 bytes (n/a)
Image support Yes No
Max number of samplers per kernel 16 (n/a)
Max size for 1D images from buffer 1073741824 pixels (n/a)
Max 1D or 2D image array size 2048 images (n/a)
Base address alignment for 2D image buffers 0 bytes (n/a)
Pitch alignment for 2D image buffers 0 pixels (n/a)
Max 2D image size 32768x32768 pixels (n/a)
Max 3D image size 2048x2048x2048 pixels (n/a)
Max number of read image args 128 (n/a)
Max number of write image args 128 (n/a)
Max number of read/write image args 128 (n/a)
Pipe support No No
Max number of pipe args 0 0
Max active pipe reservations 0 0
Max pipe packet size 0 0
Local memory type Global Local
Local memory size 1048576 (1024KiB) 49152 (48KiB)
Max number of constant args 8 8
Max constant buffer size 1048576 (1024KiB) 65536 (64KiB)
Generic address space support Yes Yes
Max size of kernel argument 1024 4352 (4.25KiB)
Queue properties (on host) Out-of-order execution Yes, Profiling Yes Out-of-order execution No, Profiling Yes
Device enqueue capabilities (n/a) (n/a)
Queue properties (on device) Out-of-order execution No, Profiling No, Preferred size 0, Max size 0 Out-of-order execution No, Profiling No, Preferred size 0, Max size 0
Max queues on device 0 0
Max events on device 0 0
Command buffer capabilities kernel printf, simultaneous use, out of order, 0x10 kernel printf, simultaneous use, out of order, 0x10
Required queue properties for command buffer Out-of-order execution No, Profiling No Out-of-order execution No, Profiling No
Prefer user sync for interop Yes Yes
Profiling timer resolution 1ns 1ns
Execution capabilities Run OpenCL kernels Yes, Run native kernels Yes, Non-uniform work-groups No, Work-group collective functions Yes, Sub-group independent forward progress Yes Run OpenCL kernels Yes, Run native kernels No, Non-uniform work-groups No, Work-group collective functions No, Sub-group independent forward progress Yes
IL version (n/a) (n/a)
ILs with version (n/a) (n/a)
printf() buffer size 16777216 (16MiB) 16777216 (16MiB)
Built-in kernels pocl.add.i8;org.khronos.openvx.scale_image.nn.u8;org.khronos.openvx.scale_image.bl.u8;org.khronos.openvx.tensor_convert_depth.wrap.u8.f32 pocl.mul.i32;pocl.add.i32;pocl.dnn.conv2d_int8_relu;pocl.sgemm.local.f32;pocl.sgemm.tensor.f16f16f32;pocl.sgemm_ab.tensor.f16f16f32;pocl.abs.f32;pocl.add.i8;org.khronos
Built-in kernels with version pocl.add.i8 0x402000 (1.2.0), org.khronos.openvx.scale_image.nn.u8 0x402000 (1.2.0), org.khronos.openvx.scale_image.bl.u8 0x402000 (1.2.0), org.khronos.openvx.tensor_convert_depth.wrap.u8.f32 0x402000 (1.2.0) pocl.mul.i32 0x402000 (1.2.0), pocl.add.i32 0x402000 (1.2.0), pocl.dnn.conv2d_int8_relu 0x402000 (1.2.0), pocl.sgemm.local.f32 0x402000 (1.2.0), pocl.sgemm.tensor.f16f16f32 0x402000 (1.2.0), pocl.sgemm_ab.tensor.f16f16f32 0x402000 (1.2.0), pocl.abs.f32 0x402000 (1.2.0), pocl.add.i8 0x402000 (1.2.0), org.khronos.openvx.scale_image.nn.u8 0x402000 (1.2.0), org.khronos.openvx.scale_image.bl.u8 0x402000 (1.2.0), org.khronos.openvx.tensor_convert_depth.wrap.u8.f32 0x402000 (1.2.0)
Device Extensions cl_khr_byte_addressable_store, cl_khr_global_int32_base_atomics, cl_khr_global_int32_extended_atomics, cl_khr_local_int32_base_atomics, cl_khr_local_int32_extended_atomics, cl_khr_3d_image_writes, cl_khr_command_buffer, cl_khr_command_buffer_multi_device, cl_khr_subgroups, cl_intel_unified_shared_memory, cl_ext_buffer_device_address, cl_pocl_svm_rect, cl_pocl_command_buffer_svm, cl_pocl_command_buffer_host_buffer, cl_khr_subgroup_ballot, cl_khr_subgroup_shuffle, cl_intel_subgroups, cl_intel_subgroups_short, cl_ext_float_atomics, cl_intel_required_subgroup_size, cl_khr_fp16, cl_khr_fp64, cl_khr_int64_base_atomics, cl_khr_int64_extended_atomics cl_khr_byte_addressable_store, cl_khr_global_int32_base_atomics, cl_khr_global_int32_extended_atomics, cl_khr_local_int32_base_atomics, cl_khr_local_int32_extended_atomics, cl_khr_int64_base_atomics, cl_khr_int64_extended_atomics, cl_nv_device_attribute_query, cl_ext_float_atomics, cl_khr_fp16, cl_khr_fp64, cl_ext_buffer_device_address, cl_khr_subgroup_ballot, cl_khr_subgroup_shuffle
Device Extensions with Version cl_khr_byte_addressable_store 0x400000 (1.0.0), cl_khr_global_int32_base_atomics 0x400000 (1.0.0), cl_khr_global_int32_extended_atomics 0x400000 (1.0.0), cl_khr_local_int32_base_atomics 0x400000 (1.0.0), cl_khr_local_int32_extended_atomics 0x400000 (1.0.0), cl_khr_3d_image_writes 0x400000 (1.0.0), cl_khr_command_buffer 0x9004 (0.9.4), cl_khr_command_buffer_multi_device 0x9001 (0.9.1), cl_khr_subgroups 0x400000 (1.0.0), cl_intel_unified_shared_memory 0x400000 (1.0.0), cl_ext_buffer_device_address 0x1000 (0.1.0), cl_pocl_svm_rect 0x9000 (0.9.0), cl_pocl_command_buffer_svm 0x9000 (0.9.0), cl_pocl_command_buffer_host_buffer 0x9000 (0.9.0), cl_khr_subgroup_ballot 0x400000 (1.0.0), cl_khr_subgroup_shuffle 0x400000 (1.0.0), cl_intel_subgroups 0x400000 (1.0.0), cl_intel_subgroups_short 0x400000 (1.0.0), cl_ext_float_atomics 0x400000 (1.0.0), cl_intel_required_subgroup_size 0x400000 (1.0.0), cl_khr_fp16 0x400000 (1.0.0), cl_khr_fp64 0x400000 (1.0.0), cl_khr_int64_base_atomics 0x400000 (1.0.0), cl_khr_int64_extended_atomics 0x400000 (1.0.0) cl_khr_byte_addressable_store 0x400000 (1.0.0), cl_khr_global_int32_base_atomics 0x400000 (1.0.0), cl_khr_global_int32_extended_atomics 0x400000 (1.0.0), cl_khr_local_int32_base_atomics 0x400000 (1.0.0), cl_khr_local_int32_extended_atomics 0x400000 (1.0.0), cl_khr_int64_base_atomics 0x400000 (1.0.0), cl_khr_int64_extended_atomics 0x400000 (1.0.0), cl_nv_device_attribute_query 0x400000 (1.0.0), cl_ext_float_atomics 0x400000 (1.0.0), cl_khr_fp16 0x400000 (1.0.0), cl_khr_fp64 0x400000 (1.0.0), cl_ext_buffer_device_address 0x1000 (0.1.0), cl_khr_subgroup_ballot 0x400000 (1.0.0), cl_khr_subgroup_shuffle 0x400000 (1.0.0)
Device Topology (NV) (n/a) PCI-E, 0000:65:00.0
Compute Capability (NV) (n/a) 8.6
Registers per block (NV) (n/a) 65536
Warp size (NV) (n/a) 32
Integrated memory (NV) (n/a) No
Kernel execution timeout (NV) (n/a) Yes
Concurrent copy and kernel execution (NV) (n/a) Yes
Number of async copy engines (NV) (n/a) 5

I’m out of ideas, sorry

No problem at all. Thank you for all your ideas so far. If you think of anything else, I would appreciate your future comments. I apologise for the multiple exchanges.

1 Like