Cmdstan samples extremely slowly with GPU

tomshafer · March 16, 2022, 5:05pm

Hi all! I’m working to enable cmdstanr/CmdStan with OpenCL on an Ubuntu Bionic system (and in a Docker container). I’ve followed the GPU install docs, and everything compiles OK—both CmdStan and the example GLM at the link.

When run against the GPU, though, the model samples extremely slowly. Trying again from the CLI shows:

$ ./lr_glm_opencl sample num_samples=100 num_warmup=100 data file=lr_glm.data.json output refresh=1 opencl device=1 platform=0

Gradient evaluation took 0.006266 seconds
1000 transitions using 10 leapfrog steps per transition would take 62.66 seconds.
Adjust your expectations accordingly!

The GPU is also barely used (~ 4% in nvidia-smi).

Compiled without OpenCL, the model samples very quickly:

$ ./lr_glm_cpu sample num_samples=100 num_warmup=100 data file=lr_glm.data.json output refresh=1
Gradient evaluation took 1.8e-05 seconds
1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Adjust your expectations accordingly!

I’m at a loss for how to troubleshoot this, given everything compiles OK. I’d appreciate any help!

CmdStan Version: 2.29.1
Compiler/Toolkit: gcc-7.5.0, CUDA compilation tools 11.4.152, CUDA 11.4

jsocolar · March 16, 2022, 5:13pm

Can you share the Stan code and some sense of the input data size? It’s possible that very little of your code is amenable to GPU acceleration, and the overhead of passing information back and forth to the GPU is slowing everything down. That is, there might not be anything to troubleshoot.

tomshafer · March 16, 2022, 5:14pm

Sure, the code and data I’m using are from this example for how to set up and test a Stan OpenCL setup: stan_gpu_install_docs/lr_glm_cmdstanr.R at master · bstatcomp/stan_gpu_install_docs · GitHub

jsocolar · March 16, 2022, 5:16pm

@rok_cesnovar

andrjohns · March 17, 2022, 4:12am

Just to double-check - the GPU acceleration in Stan relies on OpenCL rather than CUDA, do have OpenCL setup?

Can you post the output from running clinfo? Could you also upload the generated .hpp file from the model? Then I can verify whether I get the same output locally. You can get the location of the file using the hpp_file() method:

> mod$hpp_file()
[1] "C:\\Users\\Andrew Johnson\\AppData\\Local\\Temp\\Rtmp0WNnrW\\model-5d18474d4d08.hpp"

tomshafer · March 17, 2022, 11:24am

Sure thing, thanks for the help. I’ve attached both files.

clinfo.txt (29.0 KB)
lr_glm.hpp.txt (17.3 KB)

Fatih_Bozdag · March 18, 2022, 10:23pm

I have the same problem. It took 2 days to sample 60k variables.

Here are my results via brms:
OpenCL GPU support

m1 <- brm(Pattern_ID ~ Modal, data = icle_modals[1:100,], family = categorical(link = "logit"),  backend = "cmdstanr")
All 4 chains finished successfully.
Mean chain execution time: 67.9 seconds.
Total execution time: 273.1 seconds.

No OpenCL

m1 <- brm(Pattern_ID ~ Modal, data = icle_modals[1:100,], family = categorical(link = "logit"),  backend = "cmdstanr")
Running MCMC with 4 sequential chains...
All 4 chains finished successfully.
Mean chain execution time: 67.2 seconds.
Total execution time: 269.9 seconds.

Without GPU support it is slightly faster.

R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] brms_2.16.3         Rcpp_1.0.8.2        cmdstanr_0.4.0.9001 remotes_2.4.2

andrjohns · March 19, 2022, 6:10am

@tomshafer The config and generated model all appear ok to me

A couple of next steps for testing:

Is the slowdown still present if you run the model directly in the host system rather than a docker container?
Would be able to test against an older cmdstan version?

You can install an older version in cmdstanr using:

url <- "https://github.com/stan-dev/cmdstan/releases/download/v2.28.2/cmdstan-2.28.2.tar.gz"
cmdstanr::install_cmdstan(release_url = url, cores = parallel::detectCores())

I don’t have access to a Nvidia GPU to test against, so I can’t test that side of things locally. @stevebronder I vaguely remember you having an Nvidia, is that right? If so, would you mind checking the bstatcomp example code with the latest cmdstan when you have a minute?

andrjohns · March 19, 2022, 8:03am

@Fatih_Bozdag

From your example code you’re only working with 100 rows of data, this will be too small to see any benefit from GPU-acceleration. Additionally, it looks like you’re manually enabling opencl using the cmdstan make/local file, is that right? If your aim is performance comparisons, I’d recommend using the brms opencl argument instead so that it’s always clear whether opencl is being used.

The lack of CUDA support is mostly an issue of balancing developer resourcing against end-user benefit. Stan developers are primarily volunteering their time, with some grants and other minor funding sources mixed in. Implementing CUDA support would benefit only a subset of users for the time spent, compared to OpenCL

Fatih_Bozdag · March 19, 2022, 8:20am

It seems like I pasted wrong code, for the first one indeed opencl argument was used as opencl(0,0). I ran the code on full dataset however it made no difference. How can I check if everything is as it should be?

andrjohns · March 19, 2022, 8:36am

When using brms with opencl acceleration, you will only see a benefit if brms generates Stan code which can use or benefit from the acceleration. In Stan, there is a categorical_logit_glm distribution which can be GPU-accelerated. However, brms generates code which uses the categorical_logit distribution (not gpu-accelerated):

library(brms)

tmp_data <- data.frame(outcome = sample(1:4, 10, replace = T),
                      pred = rnorm(10))

make_stancode(outcome ~ pred,
              data = tmp_data,
              family = categorical("logit"),
              backend = "cmdstanr",
              opencl = opencl(c(0,0)))

Produces:

...
    for (n in 1 : N) {
      target += categorical_logit_lpmf(Y[n] | mu[n]);
    }

This is because the categorical_logit_glm distribution is not available in the current version of rstan, and brms has to remain compatible with both. Note that this is also mentioned in the brms::opencl documentation:
Only some Stan functions can be run on a GPU at this point and so a lot of brms models won't benefit from OpenCL for now.

If you’re going to be working with very large datasets that require days of computation time, you should most likely look to use Stan code itself (through cmdstanr or similar) and tune/optimise as needed, as brms has to generate code for maximum flexibility and compatibility, rather than speed and efficiency.

Note that this discussion has strayed from the original topic, so I’d recommend opening a new topic if you have any more questions

Fatih_Bozdag · March 19, 2022, 9:02am

Sorry for interrupting the original topic and many thanks for the clarifications dear andrjohns

tomshafer · March 21, 2022, 3:11pm

Thanks for the thoughts, @andrjohns. I’m not able to test on the host machine easily, but I can confirm it’s still slow through CmdStan 2.26.0.

tomshafer · March 21, 2022, 3:45pm

I also see (I think) that Stan is using OpenCL 3, but Ubuntu 18.04 LTS seems only to support 2.0, so I might need to try to rebuild my container using something newer. If I can find the time to do that, I’ll report back.

tomshafer · March 23, 2022, 8:55pm

Oh, I see that Stan Math supposedly used OpenCL 1.2 so that’s maybe not the problem.

blokeman · July 5, 2023, 11:34am

Sorry to resurrect the thread, but this finally explains why I haven’t seen benefits from GPU with brms either, compared to when I assign every core to a single chain using within-chain parallelization. Hopefully a future update will offer a remedy.

Topic		Replies	Views
CmdStan: CPU faster than GPU? General cmdstan	6	2288	February 18, 2021
Help setting up for GPU computation (OSX) Modeling gpu	5	1819	January 28, 2021
Issue using external GPU with cmdstanr and OpenCL Modeling fitting-issues	5	468	October 28, 2023
CmdStan OpenCL GPU problems and wiki page Developers	59	1967	January 29, 2020
Running Stan on the GPU with OpenCL on WSL: Seeking Assistance CmdStan linux , techniques , gpu	18	830	February 23, 2025

Cmdstan samples extremely slowly with GPU

Related topics