How to enable GPU acceleration for Cholesky decomposition in the latent variable GP model?

I want to fit the latent variable Gaussian processes(GP) model in cmdstanr with GPU acceleration.

However, I find out that no matter I enable opencl option by cpp_options = list(stan_opencl = TRUE) or not, the execution time does not change.


My stan program:

data {
  int<lower=1> N;
  array[N] real X;
  vector[N] y;
}
transformed data {
  real delta = 1e-9;
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
  vector[N] eta;
  real b0;
}
transformed parameters {
  vector[N] f;
  vector[N] fs;
  matrix[N, N] L_K;
  matrix[N, N] K;
  profile("gp_exp_quad_cov") {
    K = gp_exp_quad_cov(X, alpha, rho);
  }
  profile("add diagonal elements") {
  for (n in 1:N) {
    K[n, n] = K[n, n] + delta;
  }
  }
  profile("cholesky_decompose"){
    L_K = cholesky_decompose(K);
  }
  f = L_K * eta;
  fs = f + b0;
}
model {
  profile("priors") {
    target += inv_gamma_lupdf(rho | 1.0, 1.0);
    target += inv_gamma_lupdf(alpha | 1.0, 1.0);
    target += inv_gamma_lupdf(sigma | 1.0, 1.0);
  }
  profile("likelihood_eta") {
    target += std_normal_lupdf(eta);
  }
  profile("likelihood_y") {
    target += std_normal_lupdf((y - fs)/sigma) - N*log(sigma);
  }
}

My R program:

rm(list=ls())
library(cmdstanr)
library(posterior)
library(bayesplot)
library(MASS)
library(tidyverse)
library(gridExtra)
library(latex2exp)
color_scheme_set("brightblue")

set.seed(100)

## ground truth function
gt <- function(x) {
  return(0.2*x*sin(x))
}

## data generation
N <- 100
X <- runif(N,-10.0,10.0)
X <- sort(X)
epsilon <- rnorm(N)
y <- gt(X) + epsilon

## data list for stan
dat <- list(N = N,
            X = X,
            y = y,
            zeros = rep(as.integer(0),N),
            ones = rep(as.integer(1),N))

FG_file <- "./normal_GP_v1.1.stan"
# FG_mod <- cmdstan_model(FG_file,
#                         cpp_options = list(stan_opencl = TRUE))
FG_mod <- cmdstan_model(FG_file)

FG_fit <- FG_mod$sample(
  data = dat,
  seed = 100,
  chains = 1,
  parallel_chains = 1,
  refresh = 10,
  iter_warmup = 1000,
  save_warmup = TRUE,
  iter_sampling = 1000,
  init = function(chain_id) {
    list( rho=1.3,
          alpha=1.4,
          sigma=1.0
    )},
  # opencl_ids = c(0, 0),
)

FG_fit$profiles()

Stan profile with CPU:

> FG_fit$profiles()
[[1]]
                   name       thread_id total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1 add diagonal elements 140463246817088  0.0726325   0.05429360   0.01833880     7356900              0          73569              2001
2    cholesky_decompose 140463246817088 13.2683000   6.36483000   6.90352000       73569      371597019          73569              2001
3       gp_exp_quad_cov 140463246817088  3.3639100   3.03061000   0.33329900       73569      371523450          73569              2001
4        likelihood_eta 140463246817088  0.0262805   0.02086770   0.00541279       73569              0          73569                 1
5          likelihood_y 140463246817088  0.1084100   0.07934640   0.02906400      441414       14713800          73569                 1
6                priors 140463246817088  0.0110996   0.00847273   0.00262684      220707              0          73569                 1


Stan profile with opencl option enabled:

> FG_fit$profiles()
[[1]]
                   name       thread_id  total_time forward_time reverse_time chain_stack no_chain_stack autodiff_calls no_autodiff_calls
1 add diagonal elements 140091827372096  0.07028730   0.05407860   0.01620870     7356900              0          73569              2001
2    cholesky_decompose 140091827372096 13.23340000   6.34010000   6.89329000       73569      371597019          73569              2001
3       gp_exp_quad_cov 140091827372096  3.51115000   3.17693000   0.33421600       73569      371523450          73569              2001
4        likelihood_eta 140091827372096  0.02631320   0.02062590   0.00568732       73569              0          73569                 1
5          likelihood_y 140091827372096  0.10676300   0.07918980   0.02757290      441414       14713800          73569                 1
6                priors 140091827372096  0.00945455   0.00786438   0.00159018      220707              0          73569                 1

I don’t see any acceleration by using opencl in the Cholesky decomposition step. I have tested "opencl-files/bernoulli_logit_glm.stan" file, which does speed up if I enable opencl option. Running Stan on the GPU with OpenCL • cmdstanr


By checking the nvidia-smi, I confirm that the stan-opencl program has been loaded to the GPU.

(base) user@PC:~$ nvidia-smi
Thu Jun  9 11:45:16 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 30%   33C    P2   105W / 350W |   1681MiB / 24265MiB |      5%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1230      G   /usr/lib/xorg/Xorg                 84MiB |
|    0   N/A  N/A     27336      G   /usr/lib/xorg/Xorg                462MiB |
|    0   N/A  N/A     27455      G   /usr/bin/gnome-shell              138MiB |
|    0   N/A  N/A     27671      G   ...6_64.v03.00.0074.AppImage       13MiB |
|    0   N/A  N/A     29345      G   ...328516174520015287,131072      417MiB |
|    0   N/A  N/A     31423      G   /usr/lib/rstudio/bin/rstudio      239MiB |
|    0   N/A  N/A    113308      C   ./normal_GP_v1.1                  251MiB |
+-----------------------------------------------------------------------------+


cmdstanr information:

> cmdstan_version()
[1] "2.29.2"

Any suggestion ? Thank you!

2 Likes

Did you find any solution to your issue?
We currently have a very similar problem here, with a model that takes very long for the cholesky_decompose and the gp_exp_quad_cov functions and we wanted to move it to a GPU, but it seems like the code is still runnning entirely on the CPU.

Hi, the OpenCL backend was revamped a bit, and because of that, the GPU implementation of the cholesky_decompose is currently not exposed at the language level. We plan on re-enabling that in the near future, hopefully in this release cycle.

4 Likes