Cmdstan samples extremely slowly with GPU

Hi all! I’m working to enable cmdstanr/CmdStan with OpenCL on an Ubuntu Bionic system (and in a Docker container). I’ve followed the GPU install docs, and everything compiles OK—both CmdStan and the example GLM at the link.

When run against the GPU, though, the model samples extremely slowly. Trying again from the CLI shows:

$ ./lr_glm_opencl sample num_samples=100 num_warmup=100 data file=lr_glm.data.json output refresh=1 opencl device=1 platform=0

Gradient evaluation took 0.006266 seconds
1000 transitions using 10 leapfrog steps per transition would take 62.66 seconds.
Adjust your expectations accordingly!

The GPU is also barely used (~ 4% in nvidia-smi).

Compiled without OpenCL, the model samples very quickly:

$ ./lr_glm_cpu sample num_samples=100 num_warmup=100 data file=lr_glm.data.json output refresh=1
Gradient evaluation took 1.8e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Adjust your expectations accordingly!

I’m at a loss for how to troubleshoot this, given everything compiles OK. I’d appreciate any help!

  • CmdStan Version: 2.29.1
  • Compiler/Toolkit: gcc-7.5.0, CUDA compilation tools 11.4.152, CUDA 11.4

Can you share the Stan code and some sense of the input data size? It’s possible that very little of your code is amenable to GPU acceleration, and the overhead of passing information back and forth to the GPU is slowing everything down. That is, there might not be anything to troubleshoot.

Sure, the code and data I’m using are from this example for how to set up and test a Stan OpenCL setup: stan_gpu_install_docs/lr_glm_cmdstanr.R at master · bstatcomp/stan_gpu_install_docs · GitHub

1 Like

@rok_cesnovar

Just to double-check - the GPU acceleration in Stan relies on OpenCL rather than CUDA, do have OpenCL setup?

Can you post the output from running clinfo? Could you also upload the generated .hpp file from the model? Then I can verify whether I get the same output locally. You can get the location of the file using the hpp_file() method:

> mod$hpp_file()
[1] "C:\\Users\\Andrew Johnson\\AppData\\Local\\Temp\\Rtmp0WNnrW\\model-5d18474d4d08.hpp"

Sure thing, thanks for the help. I’ve attached both files.

I have the same problem. It took 2 days to sample 60k variables.

Here are my results via brms:
OpenCL GPU support

m1 <- brm(Pattern_ID ~ Modal, data = icle_modals[1:100,], family = categorical(link = "logit"),  backend = "cmdstanr")
All 4 chains finished successfully.
Mean chain execution time: 67.9 seconds.
Total execution time: 273.1 seconds.

No OpenCL

m1 <- brm(Pattern_ID ~ Modal, data = icle_modals[1:100,], family = categorical(link = "logit"),  backend = "cmdstanr")
Running MCMC with 4 sequential chains...
All 4 chains finished successfully.
Mean chain execution time: 67.2 seconds.
Total execution time: 269.9 seconds.

Without GPU support it is slightly faster.

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] brms_2.16.3         Rcpp_1.0.8.2        cmdstanr_0.4.0.9001 remotes_2.4.2

@tomshafer The config and generated model all appear ok to me

A couple of next steps for testing:

  1. Is the slowdown still present if you run the model directly in the host system rather than a docker container?
  2. Would be able to test against an older cmdstan version?

You can install an older version in cmdstanr using:

url <- "https://github.com/stan-dev/cmdstan/releases/download/v2.28.2/cmdstan-2.28.2.tar.gz"
cmdstanr::install_cmdstan(release_url = url, cores = parallel::detectCores())

I don’t have access to a Nvidia GPU to test against, so I can’t test that side of things locally. @stevebronder I vaguely remember you having an Nvidia, is that right? If so, would you mind checking the bstatcomp example code with the latest cmdstan when you have a minute?

@Fatih_Bozdag

From your example code you’re only working with 100 rows of data, this will be too small to see any benefit from GPU-acceleration. Additionally, it looks like you’re manually enabling opencl using the cmdstan make/local file, is that right? If your aim is performance comparisons, I’d recommend using the brms opencl argument instead so that it’s always clear whether opencl is being used.

The lack of CUDA support is mostly an issue of balancing developer resourcing against end-user benefit. Stan developers are primarily volunteering their time, with some grants and other minor funding sources mixed in. Implementing CUDA support would benefit only a subset of users for the time spent, compared to OpenCL

It seems like I pasted wrong code, for the first one indeed opencl argument was used as opencl(0,0). I ran the code on full dataset however it made no difference. How can I check if everything is as it should be?

When using brms with opencl acceleration, you will only see a benefit if brms generates Stan code which can use or benefit from the acceleration. In Stan, there is a categorical_logit_glm distribution which can be GPU-accelerated. However, brms generates code which uses the categorical_logit distribution (not gpu-accelerated):

library(brms)

tmp_data <- data.frame(outcome = sample(1:4, 10, replace = T),
                      pred = rnorm(10))

make_stancode(outcome ~ pred,
              data = tmp_data,
              family = categorical("logit"),
              backend = "cmdstanr",
              opencl = opencl(c(0,0)))

Produces:

...
    for (n in 1 : N) {
      target += categorical_logit_lpmf(Y[n] | mu[n]);
    } 

This is because the categorical_logit_glm distribution is not available in the current version of rstan, and brms has to remain compatible with both. Note that this is also mentioned in the brms::opencl documentation:
Only some Stan functions can be run on a GPU at this point and so a lot of brms models won't benefit from OpenCL for now.

If you’re going to be working with very large datasets that require days of computation time, you should most likely look to use Stan code itself (through cmdstanr or similar) and tune/optimise as needed, as brms has to generate code for maximum flexibility and compatibility, rather than speed and efficiency.

Note that this discussion has strayed from the original topic, so I’d recommend opening a new topic if you have any more questions

5 Likes

Sorry for interrupting the original topic and many thanks for the clarifications dear andrjohns

2 Likes

Thanks for the thoughts, @andrjohns. I’m not able to test on the host machine easily, but I can confirm it’s still slow through CmdStan 2.26.0.

I also see (I think) that Stan is using OpenCL 3, but Ubuntu 18.04 LTS seems only to support 2.0, so I might need to try to rebuild my container using something newer. If I can find the time to do that, I’ll report back.

Oh, I see that Stan Math supposedly used OpenCL 1.2 so that’s maybe not the problem.

Sorry to resurrect the thread, but this finally explains why I haven’t seen benefits from GPU with brms either, compared to when I assign every core to a single chain using within-chain parallelization. Hopefully a future update will offer a remedy.

1 Like