Can normal_id_glm use a GPU when the design matrix is not fixed?

I’m trying to make use of a GPU, working with the cmdstanr interface. The likelihood could be written with normal_id_glm_lpdf, though the design matrix is partly controlled by other parameters in the model. But it seems like the GPU is not being utilized whenever the design matrix is altered. For example, the GPU appears to be utilized (as reported by both nvidia-smi and nvtop) with normal_id_glm_lpdf(Y | X, 0, beta, sigma), though it stays at 0% with normal_id_glm_lpdf(Y | X * 2, 0, beta, sigma). Is that expected?

Thanks for the help,

SEED <- 04072020
set.seed(SEED)
N <- 1e6
K <- 2
X <- matrix(rnorm(N*K), ncol=K)
a <- 0
b <- rnorm(K)
sigma <- 1
Y <- rnorm(N, mean = a + (X*2) %*% b, sd = sigma)
data <- list(N = N, Y = Y, X = X, K = K, change = 1)

library(cmdstanr)
Sys.setenv(STANCFLAGS = "--use-opencl")
model <- cmdstan_model(
  "normal_glm.stan", 
  quiet = FALSE, 
  opencl = TRUE,
  opencl_platform_id = 0, 
  opencl_device_id = 0)

fit <- model$sample(data = data, num_chains = 1, num_cores = 1)
data {
  int<lower=1> N;  // number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int change;
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  sigma ~ gamma(2,1);
  beta ~ std_normal();
  // not sent to GPU when design matrix changed?
  if (change > 0) target += normal_id_glm_lpdf(Y | X * 2, 0, beta, sigma);
  else target += normal_id_glm_lpdf(Y | X, 0, beta, sigma);
}
nvidia-smi output
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.130      Driver Version: 418.130      CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID P100-8Q        On   | 00000000:02:04.0 Off |                  N/A |
| N/A   N/A    P0    N/A /  N/A |    795MiB /  8192MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     21912      C   ./normal_glm                                 267MiB |
+-----------------------------------------------------------------------------+
Session Info
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.3 (2020-02-29)
 os       Ubuntu 18.04.4 LTS          
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/New_York            
 date     2020-04-07                  

─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version    date       lib source                            
 abind         1.4-5      2016-07-21 [1] CRAN (R 3.6.3)                    
 assertthat    0.2.1      2019-03-21 [1] CRAN (R 3.6.1)                    
 backports     1.1.5      2019-10-02 [1] CRAN (R 3.6.1)                    
 callr         3.4.2      2020-02-12 [1] CRAN (R 3.6.1)                    
 checkmate     2.0.0      2020-02-06 [1] CRAN (R 3.6.1)                    
 cli           2.0.2      2020-02-28 [1] CRAN (R 3.6.3)                    
 cmdstanr    * 0.0.0.9000 2020-03-28 [1] Github (stan-dev/cmdstanr@ad9c58d)
 colorspace    1.4-1      2019-03-18 [1] standard (@1.4-1)                 
 crayon        1.3.4      2017-09-16 [1] CRAN (R 3.6.1)                    
 dplyr         0.8.4      2020-01-31 [1] CRAN (R 3.6.1)                    
 fansi         0.4.1      2020-01-08 [1] CRAN (R 3.6.1)                    
 ggplot2       3.2.1      2019-08-10 [1] CRAN (R 3.6.1)                    
 glue          1.3.2      2020-03-12 [1] CRAN (R 3.6.3)                    
 gridExtra     2.3        2017-09-09 [1] CRAN (R 3.6.1)                    
 gtable        0.3.0      2019-03-25 [1] standard (@0.3.0)                 
 inline        0.3.15     2018-05-18 [1] CRAN (R 3.6.1)                    
 jsonlite      1.6.1      2020-02-02 [1] CRAN (R 3.6.1)                    
 lazyeval      0.2.2      2019-03-15 [1] standard (@0.2.2)                 
 lifecycle     0.1.0      2019-08-01 [1] CRAN (R 3.6.1)                    
 loo           2.2.0      2019-12-19 [1] CRAN (R 3.6.1)                    
 magrittr      1.5        2014-11-22 [1] CRAN (R 3.6.1)                    
 matrixStats   0.55.0     2019-09-07 [1] CRAN (R 3.6.1)                    
 munsell       0.5.0      2018-06-12 [1] standard (@0.5.0)                 
 pillar        1.4.3      2019-12-20 [1] CRAN (R 3.6.1)                    
 pkgbuild      1.0.6      2019-10-09 [1] CRAN (R 3.6.1)                    
 pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 3.6.1)                    
 posterior     0.0.2      2020-03-28 [1] Github (jgabry/posterior@90d75b6) 
 prettyunits   1.1.1      2020-01-24 [1] CRAN (R 3.6.1)                    
 processx      3.4.2      2020-02-09 [1] CRAN (R 3.6.1)                    
 ps            1.3.2      2020-02-13 [1] CRAN (R 3.6.1)                    
 purrr         0.3.3      2019-10-18 [1] CRAN (R 3.6.1)                    
 R6            2.4.1      2019-11-12 [1] CRAN (R 3.6.1)                    
 Rcpp          1.0.3      2019-11-08 [1] CRAN (R 3.6.1)                    
 rlang         0.4.5      2020-03-01 [1] CRAN (R 3.6.3)                    
 rstan         2.19.3     2020-02-11 [1] CRAN (R 3.6.1)                    
 rstudioapi    0.11       2020-02-07 [1] CRAN (R 3.6.1)                    
 scales        1.1.0      2019-11-18 [1] CRAN (R 3.6.1)                    
 sessioninfo   1.1.1      2018-11-05 [1] standard (@1.1.1)                 
 StanHeaders   2.19.2     2020-02-11 [1] CRAN (R 3.6.1)                    
 tibble        2.1.3      2019-06-06 [1] CRAN (R 3.6.1)                    
 tidyselect    1.0.0      2020-01-27 [1] CRAN (R 3.6.1)                    
 withr         2.1.2      2018-03-15 [1] CRAN (R 3.6.1)                    

[1] /home/vm01/R/x86_64-pc-linux-gnu-library/3.6
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

I have to take a deeper look at this example but have you tried with transformed data

transformed data {
   matrix[N, K] X2 = X * 2;
}

and then using X2 in the call?

I hadn’t tried that. When X2 is built in the transformed data block, the utilization picks up (e.g., below). Though, at least in my application this won’t work since X needs to be built from model parameters.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.130      Driver Version: 418.130      CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID P100-8Q        On   | 00000000:02:04.0 Off |                  N/A |
| N/A   N/A    P0    N/A /  N/A |    819MiB /  8192MiB |     28%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     24541      C   ./normal_glm                                 291MiB |
+-----------------------------------------------------------------------------+

In order for normal_id_glm_lpdf to be faster on a GPU the second argument must be data (or made in transformed data). For now.

A bit of context if you are interested. Ignore if not:

In your case X * parameter is treated the same as a parameter as in “this would need to be copied to the GPU every time as it changes for each iteration” because the sampler runs on the CPU.

But if we were to also run multiply(X, parameter) on the GPU we would be able to support your case also (only the parameter would need to be transfered, which is cheap). And this is what we are aiming for, but not there yet, unfortunately for you.

2 Likes