Thanks for tagging @torkar
The instructions you linked are intended for working with OpenCL stuff in C++ in Stan Math (there is much more OpenCL related stuff implemented in Stan Math and most of it has not made it to Stan yet).
I think that is probably the first think linked if you write Stan GPU in google.
Instructions for CmdStan and OpenCL are now available here: 14 Parallelization | CmdStan User’s Guide
They are very new and thus probably have not been indexed by searches yet.
Here is a cmdstanr example:
library(cmdstanr)
generator = function(seed = 0, n = 1000, k = 10) {
set.seed(seed)
X <- matrix(rnorm(n * k), ncol = k)
y <- 3 * X[,1] - 2 * X[,2] + 1
y <- ifelse(runif(n) < 1 / (1 + exp(-y)), 1, 0)
list(k = ncol(X), n = nrow(X), y = y, X = X)
}
data <- generator(1, 100000, 20)
# we will write the data to da file ourselves
# so we dont do it twice for GPU an CPU version
data_file <- paste0(tempfile(), ".json")
write_stan_json(data, data_file)
opencl_options = list(
stan_opencl = TRUE,
opencl_platform_id = 0,
opencl_device_id = 0 #in your case its 1 here
)
model_code <- "
data {
int<lower=1> k;
int<lower=0> n;
matrix[n, k] X;
int y[n];
}
parameters {
vector[k] beta;
real alpha;
}
model {
target += bernoulli_logit_glm_lpmf(y | X, alpha, beta);
}
"
stan_file <- write_stan_file(model_code)
mod <- cmdstan_model(stan_file)
mod_cl <- cmdstan_model(stan_file, cpp_options = opencl_options)
fit <- mod$sample(data = data_file, iter_sampling = 500, iter_warmup = 500, chains = 4, parallel_chains = 4, refresh = 0)
fit_cl <- mod_cl$sample(data = data_file, iter_sampling=500, iter_warmup = 500, chains = 4, parallel_chains = 4, refresh = 0)
We get the following:
CPU
Running MCMC with 4 parallel chains...
Chain 1 finished in 104.1 seconds.
Chain 3 finished in 104.4 seconds.
Chain 2 finished in 104.9 seconds.
Chain 4 finished in 104.7 seconds.
All 4 chains finished successfully.
Mean chain execution time: 104.5 seconds.
Total execution time: 105.7 seconds.
Running MCMC with 4 parallel chains...
GPU
Chain 3 finished in 15.6 seconds.
Chain 1 finished in 15.7 seconds.
Chain 2 finished in 15.7 seconds.
Chain 4 finished in 16.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 15.7 seconds.
Total execution time: 17.0 seconds.
In this example devices are selected at compile time.
Cmdstan 2.26 also support runtime selection of devices, but that has not made it to cmdstanr yet (it hopefully does this week). Will also add a vignette with exactly this example.