CmdStan: CPU faster than GPU?

scijens · February 17, 2021, 4:55pm

Hello everyone,

I am running a hidden Markov model on a Linux server. I hoped to gain speed improvements by accessing a GPU instead of a CPU, but it takes ~ five times as long.

The GPU uses a NVIDIA-SMI 440.33.01. One thing I noticed is that the Nvidia driver automatically reduces the OpenCL version I am loading.

If I load module load gnu/7.4.0 opencl/2.2 instead of module load gnu/7.4.0 cuda/10.2, the estimation still generates no load on the GPU and runs very slowly.

Following the tutorial to run cmdstan on GPU, I don’t have to make any adjustment to my model to run it with cmdstan on a GPU. So I am wondering if my model cannot be more efficient on a GPU due to its structure. Is there anything else I can try?

Looking forward to your ideas.

rok_cesnovar · February 17, 2021, 5:01pm

Hi,

the list of currently speedup up functions is given here:

For now, the only functions that get a speedup are the lpdf/lpmf distribution functions, the cholesky decomposition and matrix multiply. For the rest you will have to wait for Stan 2.27 or tinker with C++ (the backend support is already there it just not used ATM).

rok_cesnovar · February 17, 2021, 5:06pm

Also, the best guide for installing and using OpenCL with CmdStan is available in the CmdStan guide: 14 Parallelization | CmdStan User’s Guide

Its fairly fresh so I know it does not yet get picked up in google when you type OpenCL or GPU and Stan. So just taking this opportunity to share it again.

scijens · February 17, 2021, 5:11pm

Thanks. My model has lpdf/lpmf distribution functions and matrix multiplications, so I expected at least to have a bit of speed improvement. What makes me wonder is that the model runs way slower on GPU.

Is there anything I can try to see if my installation is correct?

I noticed that you added a message that provides a link to the “best” guide. I have checked it and this is what I did…

rok_cesnovar · February 17, 2021, 5:34pm

Yes, that instructions you pointed are good, its just for Stan Math and that can confuse some people. Clearly it didnt confuse you which is good :) I just linked them again so they get more exposure.

Well, this really depends on the input sizes and which lpdf you are using. For example the poisson distribution can be much faster while the bernoulli one is only faster slightly. If you cant share the model I would advise profiing your model and seeing the bottlenecks and where the slowdown comes from.

Profiling example: https://gist.github.com/rok-cesnovar/f9b85010dc08fb74582e13a91909a3f7
This is an example with cmdstanr: Profiling Stan programs with CmdStanR • cmdstanr

GPU use isnt always faster as it involves data transfers.

scijens · February 18, 2021, 8:10am

The profiling appears to be a helpful tool, I will take a look, thanks a lot.

If you don’t mind, could you please take a look if the calls in my batch script are correct?

#!/bin/bash -l

#SBATCH --partition=projectsexc
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --mem=8GB
#SBATCH --time=3-0
#SBATCH --output=%x-%j.out
#SBATCH --account=username

module load gnu/7.4.0 cuda/10.2

cd $HOME/cmdstan-2.26.0/

make STAN_OPENCL=true OPENCL_PLATFORM_ID=0 OPENCL_DEVICE_ID=0 $HOME/cmdstan-2.26.0/jobfiles/hurdle_mod

cd $HOME/cmdstan-2.26.0/jobfiles/

./hurdle_mod sample algorithm=hmc engine=nuts max_depth=10 num_samples=2000 random seed=1 data file=$HOME/data/hurdle_data.R opencl platform=0 device=0 output file=output_1.csv

And here is the corresponding output:

method = sample (Default)
  sample
    num_samples = 2000
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delta = 0.80000000000000004 (Default)
      kappa = 0.75 (Default)
      t0 = 10 (Default)
      init_buffer = 75 (Default)
      term_buffer = 50 (Default)
      window = 25 (Default)
    algorithm = hmc (Default)
      hmc
        engine = nuts (Default)
          nuts
            max_depth = 10 (Default)
        metric = diag_e (Default)
        metric_file =  (Default)
        stepsize = 1 (Default)
        stepsize_jitter = 0 (Default)
id = 0 (Default)
data
  file = /home/data/hurdle_data.R
init = 2 (Default)
random
  seed = 1
output
  file = output_1.csv
  diagnostic_file =  (Default)
  refresh = 100 (Default)
  sig_figs = -1 (Default)
  profile_file = profile.csv (Default)
opencl
  device = 0
  platform = 0
opencl_platform_name = NVIDIA CUDA
opencl_device_name = Tesla V100-SXM2-32GB

Thank you.

rok_cesnovar · February 18, 2021, 10:53am

This is redundant in the make call as you supply the IDs in

 opencl platform=0 device=0 output file=output_1.csv

otherwise seems to be good.

Topic		Replies	Views
Too slow on cmdstan with gpu than pystan on cpu Modeling fitting-issues , performance	4	1373	October 25, 2021
Cmdstan samples extremely slowly with GPU CmdStan cmdstanr	15	2187	July 5, 2023
How do I use GPUs with CmdStan? Developers	11	1177	September 11, 2020
Cmdstan 2.18 MPI Modeling	36	3084	September 12, 2018
Parallelization CmdStan	7	88	March 20, 2025

CmdStan: CPU faster than GPU?

Related topics