I want to fit the latent variable Gaussian processes(GP) model in cmdstanr
with GPU acceleration.
However, I find out that no matter I enable opencl option by cpp_options = list(stan_opencl = TRUE)
or not, the execution time does not change.
My stan program:
data {
int<lower=1> N;
array[N] real X;
vector[N] y;
}
transformed data {
real delta = 1e-9;
}
parameters {
real<lower=0> rho;
real<lower=0> alpha;
real<lower=0> sigma;
vector[N] eta;
real b0;
}
transformed parameters {
vector[N] f;
vector[N] fs;
matrix[N, N] L_K;
matrix[N, N] K;
profile("gp_exp_quad_cov") {
K = gp_exp_quad_cov(X, alpha, rho);
}
profile("add diagonal elements") {
for (n in 1:N) {
K[n, n] = K[n, n] + delta;
}
}
profile("cholesky_decompose"){
L_K = cholesky_decompose(K);
}
f = L_K * eta;
fs = f + b0;
}
model {
profile("priors") {
target += inv_gamma_lupdf(rho | 1.0, 1.0);
target += inv_gamma_lupdf(alpha | 1.0, 1.0);
target += inv_gamma_lupdf(sigma | 1.0, 1.0);
}
profile("likelihood_eta") {
target += std_normal_lupdf(eta);
}
profile("likelihood_y") {
target += std_normal_lupdf((y - fs)/sigma) - N*log(sigma);
}
}
My R program:
rm(list=ls())
library(cmdstanr)
library(posterior)
library(bayesplot)
library(MASS)
library(tidyverse)
library(gridExtra)
library(latex2exp)
color_scheme_set("brightblue")
set.seed(100)
## ground truth function
gt <- function(x) {
return(0.2*x*sin(x))
}
## data generation
N <- 100
X <- runif(N,-10.0,10.0)
X <- sort(X)
epsilon <- rnorm(N)
y <- gt(X) + epsilon
## data list for stan
dat <- list(N = N,
X = X,
y = y,
zeros = rep(as.integer(0),N),
ones = rep(as.integer(1),N))
FG_file <- "./normal_GP_v1.1.stan"
# FG_mod <- cmdstan_model(FG_file,
# cpp_options = list(stan_opencl = TRUE))
FG_mod <- cmdstan_model(FG_file)
FG_fit <- FG_mod$sample(
data = dat,
seed = 100,
chains = 1,
parallel_chains = 1,
refresh = 10,
iter_warmup = 1000,
save_warmup = TRUE,
iter_sampling = 1000,
init = function(chain_id) {
list( rho=1.3,
alpha=1.4,
sigma=1.0
)},
# opencl_ids = c(0, 0),
)
FG_fit$profiles()
Stan profile with CPU:
> FG_fit$profiles()
[[1]]
name thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1 add diagonal elements 140463246817088 0.0726325 0.05429360 0.01833880 7356900 0 73569 2001
2 cholesky_decompose 140463246817088 13.2683000 6.36483000 6.90352000 73569 371597019 73569 2001
3 gp_exp_quad_cov 140463246817088 3.3639100 3.03061000 0.33329900 73569 371523450 73569 2001
4 likelihood_eta 140463246817088 0.0262805 0.02086770 0.00541279 73569 0 73569 1
5 likelihood_y 140463246817088 0.1084100 0.07934640 0.02906400 441414 14713800 73569 1
6 priors 140463246817088 0.0110996 0.00847273 0.00262684 220707 0 73569 1
Stan profile with opencl
option enabled:
> FG_fit$profiles()
[[1]]
name thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1 add diagonal elements 140091827372096 0.07028730 0.05407860 0.01620870 7356900 0 73569 2001
2 cholesky_decompose 140091827372096 13.23340000 6.34010000 6.89329000 73569 371597019 73569 2001
3 gp_exp_quad_cov 140091827372096 3.51115000 3.17693000 0.33421600 73569 371523450 73569 2001
4 likelihood_eta 140091827372096 0.02631320 0.02062590 0.00568732 73569 0 73569 1
5 likelihood_y 140091827372096 0.10676300 0.07918980 0.02757290 441414 14713800 73569 1
6 priors 140091827372096 0.00945455 0.00786438 0.00159018 220707 0 73569 1
I don’t see any acceleration by using opencl
in the Cholesky decomposition step. I have tested "opencl-files/bernoulli_logit_glm.stan"
file, which does speed up if I enable opencl
option. Running Stan on the GPU with OpenCL • cmdstanr
By checking the nvidia-smi
, I confirm that the stan-opencl program has been loaded to the GPU.
(base) user@PC:~$ nvidia-smi
Thu Jun 9 11:45:16 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 On | N/A |
| 30% 33C P2 105W / 350W | 1681MiB / 24265MiB | 5% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1230 G /usr/lib/xorg/Xorg 84MiB |
| 0 N/A N/A 27336 G /usr/lib/xorg/Xorg 462MiB |
| 0 N/A N/A 27455 G /usr/bin/gnome-shell 138MiB |
| 0 N/A N/A 27671 G ...6_64.v03.00.0074.AppImage 13MiB |
| 0 N/A N/A 29345 G ...328516174520015287,131072 417MiB |
| 0 N/A N/A 31423 G /usr/lib/rstudio/bin/rstudio 239MiB |
| 0 N/A N/A 113308 C ./normal_GP_v1.1 251MiB |
+-----------------------------------------------------------------------------+
cmdstanr
information:
> cmdstan_version()
[1] "2.29.2"
Any suggestion ? Thank you!