Help setting up for GPU computation (OSX)

I’m trying to set myself up to give Stan a spin on the GPU. I’ve tried to follow instructions, but I think I’m stuck. I’ve checked the output of clinfo -l :

Platform #0: Apple
 +-- Device #0: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
 `-- Device #1: AMD Radeon Pro 580 Compute Engine

From this I understand that I need to put the following in a text file named local sitting in a directory called make that sits in the top level of the math library.


I’ve installed CmdStan 2.26.0 using cmdstanr::install_cmdstan(). My best guess for what is the “top level of the math library” is /Users/jacobsocolar/.cmdstanr/cmdstan-2.26.0/stan/lib/stan_math. Does this look right? So I now have a text file at /Users/jacobsocolar/.cmdstanr/cmdstan-2.26.0/stan/lib/stan_math/make/local that contains:


Does this look right as well?
So then I try to figure out whether this GPU thing is working. In terminal I can run, for example,

cd /Users/jacobsocolar/.cmdstanr/cmdstan-2.26.0/stan/lib/stan_math/
python test/unit -f opencl

but I’m not too clear on what I should be looking for in the big text dump that comes out.
On the other hand, I’ve taken the logistic regression example from the GPU support for Stan paper:

data {
int < lower =1 > k ;
int < lower =0 > n ;
matrix [n , k ] X ;
int y [ n ];
parameters {
vector [ k ] beta ;
real alpha ;
model {
target += bernoulli_logit_glm_lpmf ( y | X , alpha , beta );

And in R I run:

n <- 1e+6
k <- 10

X <- matrix(rnorm(n*k), nrow=n)
mu <- 3*X[,1] - 2*X[,2] + 1
y <- rbinom(n, 1, 1/(1+exp(-mu)))

stan_data <- list(k=k, n=n, X=X, y=y)

gpu_test_mod <- cmdstan_model("/Users/jacobsocolar/Desktop/gpu_logistic_test.stan", force_recompile = T)
test_sampling <- gpu_test_mod$sample(data=stan_data, chains=3, parallel_chains = 3)

And I get execution times for the $sample line on the order of 1300 seconds, which is a lot longer than I expected if this is running on the GPU?

Am I missing something?

I think @rok_cesnovar is the one who can answer this :)


Thanks for tagging @torkar

The instructions you linked are intended for working with OpenCL stuff in C++ in Stan Math (there is much more OpenCL related stuff implemented in Stan Math and most of it has not made it to Stan yet).
I think that is probably the first think linked if you write Stan GPU in google.

Instructions for CmdStan and OpenCL are now available here: 14 Parallelization | CmdStan User’s Guide

They are very new and thus probably have not been indexed by searches yet.

Here is a cmdstanr example:


generator = function(seed = 0, n = 1000, k = 10) {
  X <- matrix(rnorm(n * k), ncol = k)
  y <- 3 * X[,1] - 2 * X[,2] + 1
  y <- ifelse(runif(n) < 1 / (1 + exp(-y)), 1, 0)
  list(k = ncol(X), n = nrow(X), y = y, X = X)

data <- generator(1, 100000, 20)

# we will write the data to da file ourselves
# so we dont do it twice for GPU an CPU version
data_file <- paste0(tempfile(), ".json")
write_stan_json(data, data_file)

opencl_options = list(
  stan_opencl = TRUE,
  opencl_platform_id = 0,
  opencl_device_id = 0 #in your case its 1 here

model_code <- "
data {
  int<lower=1> k;
  int<lower=0> n;
  matrix[n, k] X;
  int y[n];
parameters {
  vector[k] beta;
  real alpha;

model {
  target += bernoulli_logit_glm_lpmf(y | X, alpha, beta);

stan_file <- write_stan_file(model_code)

mod <- cmdstan_model(stan_file)
mod_cl <- cmdstan_model(stan_file, cpp_options = opencl_options)

fit <- mod$sample(data = data_file, iter_sampling = 500, iter_warmup = 500, chains = 4, parallel_chains = 4, refresh = 0)
fit_cl <- mod_cl$sample(data = data_file, iter_sampling=500, iter_warmup = 500, chains = 4, parallel_chains = 4, refresh = 0)

We get the following:


Running MCMC with 4 parallel chains...

Chain 1 finished in 104.1 seconds.
Chain 3 finished in 104.4 seconds.
Chain 2 finished in 104.9 seconds.
Chain 4 finished in 104.7 seconds.

All 4 chains finished successfully.
Mean chain execution time: 104.5 seconds.
Total execution time: 105.7 seconds.

Running MCMC with 4 parallel chains...


Chain 3 finished in 15.6 seconds.
Chain 1 finished in 15.7 seconds.
Chain 2 finished in 15.7 seconds.
Chain 4 finished in 16.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 15.7 seconds.
Total execution time: 17.0 seconds.

In this example devices are selected at compile time.

Cmdstan 2.26 also support runtime selection of devices, but that has not made it to cmdstanr yet (it hopefully does this week). Will also add a vignette with exactly this example.


Also I have to note the intructions lack details for MacOS.

That is primarily because AFAIK Apple dropped support for OpenCL/CUDA/etc awhile back for their own compute language Metal(2). My impression was that OpenCL thus does not work on Macs at all anyways. Are you using an older OSX version?

If this works for you we may need to add those instructions as well.

1 Like

Thanks so much @rok_cesnovar!
On MacOS 10.14.6 (Mojave), I have been able to get up and running on the GPU. I was able to follow the instructions more-or-less exactly. The speedups on my system are less impressive than on yours, but still substantial (c. 90 seconds on the CPU, c. 48 seconds on the GPU).

In case it helps anybody else, there were just two minor extra bits that I needed to do. First, to run clinfo -l, I first ran brew install clinfo. Second, when I first attempted to compile a model with cpp_options = opencl_options, I got the error:
error: definition of macro 'OPENCL_DEVICE_ID' differs between the precompiled header ('0') and the command line ('1')

I guessed that this might have resulted from my previous dont-know-what-im-doing tinkering with the various make/local files in (subdirectories of) the .cmdstanr directory. So I just manually deleted /Users/jacobsocolar/.cmdstanr/cmdstan-2.26.0 and reinstalled cmdstan from R with cmdstanr::install_cmdstan.

At that point, it all worked like a charm.

Editing to also add:
For anyone trying this, on my system when the model is running on the GPU there is a huge and very obvious spike in GPU usage. This feels like a no-brainer, but as somebody like me who isn’t really familiar with the capabilities of my GPU or with the capabilities of Stan to use most or all of the GPU, I had previously been squinting at the GPU usage during model runs to try to discern if anything was happening. You won’t need to squint; you’ll know it when you see it (with this example anyway).


Oh yes, on Mojave it could work. Thanks!