Issue using external GPU with cmdstanr and OpenCL

Hi all! I am trying use OpenCL to run cmdstanr models with an AMD Radeon Pro 580 GPU. I am using Mac OSX 10.15.7 (Catalina).

I successfully installed cmdstan and check_cmdstan_toolchain() returns that it is set up correctly.

However, when I try to fit an example model, it fails. Here is a simple example (from Running Stan on the GPU with OpenCL • cmdstanr):

data {
  int<lower=1> k;
  int<lower=0> n;
  matrix[n, k] X;
  array[n] int y;
}
parameters {
  vector[k] beta;
  real alpha;
}
model {
  target += std_normal_lpdf(beta);
  target += std_normal_lpdf(alpha);
  target += bernoulli_logit_glm_lpmf(y | X, alpha, beta);
}

And some generated data:

# Generate some fake data
n <- 250000
k <- 20
X <- matrix(rnorm(n * k), ncol = k)
y <- rbinom(n, size = 1, prob = plogis(3 * X[,1] - 2 * X[,2] + 1))
mdata <- list(k = k, n = n, y = y, X = X)

It runs just fine on the CPU using this code:

# no OpenCL version
mod <- cmdstan_model("bernoulli_logit_glm.stan")
fit_cpu <- mod$sample(data = mdata, chains = 4, parallel_chains = 4, refresh = 0)

However, when I try to run using OpenCL, it fails:

mod_cl <- cmdstan_model("bernoulli_logit_glm.stan",
                        cpp_options = list(stan_opencl = TRUE), force_recompile=TRUE)
fit_cl <- mod_cl$sample(data = mdata, chains = 4, parallel_chains = 4,
                        opencl_ids = c(0, 1), refresh = 0)

This is the output I get:

# Running MCMC with 4 parallel chains...

# Chain 1 cvmsBuildComputeProgram No CVMS service
# Chain 1 Unrecoverable error evaluating the log probability at the initial value.
# Chain 1 Exception: compile_kernel: calculate : Unknown error -11 (in '/var/folders/ch/qb5870nx4r31bjqvyjr656pr0000gp/T/Rtmph5Fw5s/model-73092c28f33c.stan', line 14, column 2 to column 57)
# Chain 2 cvmsBuildComputeProgram No CVMS service
# Chain 2 Unrecoverable error evaluating the log probability at the initial value.
# Chain 2 Exception: compile_kernel: calculate : Unknown error -11 (in '/var/folders/ch/qb5870nx4r31bjqvyjr656pr0000gp/T/Rtmph5Fw5s/model-73092c28f33c.stan', line 14, column 2 to column 57)
# Chain 3 cvmsBuildComputeProgram No CVMS service
# Chain 3 Unrecoverable error evaluating the log probability at the initial value.
# Chain 3 Exception: compile_kernel: calculate : Unknown error -11 (in '/var/folders/ch/qb5870nx4r31bjqvyjr656pr0000gp/T/Rtmph5Fw5s/model-73092c28f33c.stan', line 14, column 2 to column 57)
# Chain 4 cvmsBuildComputeProgram No CVMS service
# Chain 4 Unrecoverable error evaluating the log probability at the initial value.
# Chain 4 Exception: compile_kernel: calculate : Unknown error -11 (in '/var/folders/ch/qb5870nx4r31bjqvyjr656pr0000gp/T/Rtmph5Fw5s/model-73092c28f33c.stan', line 14, column 2 to column 57)
# Warning: Chain 1 finished unexpectedly!

# Warning: Chain 2 finished unexpectedly!

# Warning: Chain 3 finished unexpectedly!

# Warning: Chain 4 finished unexpectedly!

# Warning: Use read_cmdstan_csv() to read the results of the failed chains.
# Warning messages:
# 1: All chains finished unexpectedly! Use the $output(chain_id) method for more information.

# 2: No chains finished successfully. Unable to retrieve the fit.

When I google “cvmsBuildComputeProgram No CVMS service” I get no results so I am at a loss. I know I’m specifying the GPU device ID correctly because I checked the platform and device ID, and if I set other combos of platform & device IDs I get errors saying the devices don’t exist. Anyone have a similar issue or fix? Thanks in advance!!

Just checking: you’re using a version of Stan that is as new or newer than 2.18, right? Before that bernoulli_logit_glm_lpmf couldn’t take an array

Yep, I’m using stan 2.3. The CPU version runs so there is no issue with the stan model itself.

And just checking one more thing:

“As of version 2.26.1, users can expect speedups with OpenCL when using vectorized probability distribution functions (functions with the _lpdf or _lpmf suffix) and when the input variables contain at least 20,000 elements.”

Do your input variables meet the 20,000 elements requirement?

Oh, I did not know that!!! Thanks for pointing that out :). In this particular example there are 250,000 data points.

@rok_cesnovar