CmdStan: CPU faster than GPU?

Hello everyone,

I am running a hidden Markov model on a Linux server. I hoped to gain speed improvements by accessing a GPU instead of a CPU, but it takes ~ five times as long.

The GPU uses a NVIDIA-SMI 440.33.01. One thing I noticed is that the Nvidia driver automatically reduces the OpenCL version I am loading.

If I load module load gnu/7.4.0 opencl/2.2 instead of module load gnu/7.4.0 cuda/10.2, the estimation still generates no load on the GPU and runs very slowly.

Following the tutorial to run cmdstan on GPU, I don’t have to make any adjustment to my model to run it with cmdstan on a GPU. So I am wondering if my model cannot be more efficient on a GPU due to its structure. Is there anything else I can try?

Looking forward to your ideas.

1 Like


the list of currently speedup up functions is given here:

For now, the only functions that get a speedup are the lpdf/lpmf distribution functions, the cholesky decomposition and matrix multiply. For the rest you will have to wait for Stan 2.27 or tinker with C++ (the backend support is already there it just not used ATM).

1 Like

Also, the best guide for installing and using OpenCL with CmdStan is available in the CmdStan guide: 14 Parallelization | CmdStan User’s Guide

Its fairly fresh so I know it does not yet get picked up in google when you type OpenCL or GPU and Stan. So just taking this opportunity to share it again.


Thanks. My model has lpdf/lpmf distribution functions and matrix multiplications, so I expected at least to have a bit of speed improvement. What makes me wonder is that the model runs way slower on GPU.

Is there anything I can try to see if my installation is correct?

I noticed that you added a message that provides a link to the “best” guide. I have checked it and this is what I did…

Yes, that instructions you pointed are good, its just for Stan Math and that can confuse some people. Clearly it didnt confuse you which is good :) I just linked them again so they get more exposure.

Well, this really depends on the input sizes and which lpdf you are using. For example the poisson distribution can be much faster while the bernoulli one is only faster slightly. If you cant share the model I would advise profiing your model and seeing the bottlenecks and where the slowdown comes from.

Profiling example: · GitHub
This is an example with cmdstanr: Profiling Stan programs with CmdStanR • cmdstanr

GPU use isnt always faster as it involves data transfers.


The profiling appears to be a helpful tool, I will take a look, thanks a lot.

If you don’t mind, could you please take a look if the calls in my batch script are correct?

#!/bin/bash -l

#SBATCH --partition=projectsexc
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --mem=8GB
#SBATCH --time=3-0
#SBATCH --output=%x-%j.out
#SBATCH --account=username

module load gnu/7.4.0 cuda/10.2

cd $HOME/cmdstan-2.26.0/

make STAN_OPENCL=true OPENCL_PLATFORM_ID=0 OPENCL_DEVICE_ID=0 $HOME/cmdstan-2.26.0/jobfiles/hurdle_mod

cd $HOME/cmdstan-2.26.0/jobfiles/

./hurdle_mod sample algorithm=hmc engine=nuts max_depth=10 num_samples=2000 random seed=1 data file=$HOME/data/hurdle_data.R opencl platform=0 device=0 output file=output_1.csv

And here is the corresponding output:

method = sample (Default)
    num_samples = 2000
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
        engine = nuts (Default)
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
  file = /home/data/hurdle_data.R
init = 2 (Default)
  seed = 1
  file = output_1.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = profile.csv (Default)
  device = 0
  platform = 0
opencl_platform_name = NVIDIA CUDA
opencl_device_name = Tesla V100-SXM2-32GB

Thank you.

This is redundant in the make call as you supply the IDs in

 opencl platform=0 device=0 output file=output_1.csv

otherwise seems to be good.

1 Like