Stan GPU flags

Hi, I have couple of things to ask / mention.

  1. GPU flags on Windows (CmdStan)
  2. Needed flags for Stan-math (PyStan)

CmdStan

I was able to get GPU running on Windows with CmdStan (CmdStanPy interface) following the instructions in https://github.com/stan-dev/math/wiki/OpenCL-GPU-Routines

I only needed to change LDFLAGS_OPENCL= -L"$(CUDA_PATH)\lib\x64" -lOpenCL

from

LDFLAGS_OPENCL= -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64" -lOpenCL

to

LDFLAGS_OPENCL= -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64" C:/Windows/System32/OpenCL.dll

Could someone explain why this works and -lOpenCL fails?

(Used conda installed mingw-w64 (gcc) and mingw32-make: conda install m2w64-toolchain -c msys2)
(Short path is from cmdstanpy.utils.windows_short_path)

PyStan

I then tried to do the same with PyStan (added stan-math OpenCl to path inside pystan/model.py, and added extra_link_args)

My input is

stan_model = pystan.StanModel(
    model_code=stan_code, 
    extra_compile_args = ["-DSTAN_OPENCL",
                                        "-DOPENCL_DEVICE_ID=0",
                                        "-DOPENCL_PLATFORM_ID=0",
                                       ],
    extra_link_args = ['-L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64"',
                                 "C:/Windows/System32/OpenCL.dll",
                                 ], 
    verbose=True)

The output is

Compiling C:\Users\user\AppData\Local\Temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.pyx because it changed.
[1/1] Cythonizing C:\Users\user\AppData\Local\Temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.pyx
building 'stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004' extension
C:\Users\user\miniconda3\envs\stan\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -IC:\Users\user\AppData\Local\Temp\tmp0rlxxu53 -Ic:\users\user\github\pystan\pystan -Ic:\users\user\github\pystan\pystan\stan\src -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math\lib\eigen_3.3.3 -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math\lib\boost_1.69.0 -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math\lib\sundials_4.1.0\include -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math\lib\opencl_1.2.8 -IC:\Users\user\miniconda3\envs\stan\lib\site-packages\numpy\core\include -IC:\Users\user\miniconda3\envs\stan\include -IC:\Users\user\miniconda3\envs\stan\include -c C:\Users\user\AppData\Local\Temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.cpp -o c:\users\user\appdata\local\temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.o -O2 -ftemplate-depth-256 -Wno-unused-function -Wno-uninitialized -std=c++1y -D_hypot=hypot -pthread -fexceptions -DSTAN_OPENCL -DOPENCL_DEVICE_ID=0 -DOPENCL_PLATFORM_ID=0
writing c:\users\user\appdata\local\temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.cp37-win_amd64.def
C:\Users\user\miniconda3\envs\stan\Library\mingw-w64\bin\g++.exe -shared -s c:\users\user\appdata\local\temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.o c:\users\user\appdata\local\temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.cp37-win_amd64.def -LC:\Users\user\miniconda3\envs\stan\libs -LC:\Users\user\miniconda3\envs\stan\PCbuild\amd64 -lpython37 -lmsvcr140 -o C:\Users\user\AppData\Local\Temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.cp37-win_amd64.pyd -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64" C:/Windows/System32/OpenCL.dll

This gets stuck in the linking step, not sure what is the problem.

cc @ariddell

What are the needed flags for Stan-math GPU

Model

Model is a GP with cholesky_decompose

data {
  int<lower=1> N;
  real x[N];
  vector[N] y;
}
transformed data {
  vector[N] mu = rep_vector(0, N);
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
model {
  matrix[N, N] L_K;
  matrix[N, N] K = cov_exp_quad(x, alpha, rho);
  real sq_sigma = square(sigma);

  // diagonal elements
  for (n in 1:N)
    K[n, n] = K[n, n] + sq_sigma;

  L_K = cholesky_decompose(K);

  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();

  y ~ multi_normal_cholesky(mu, L_K);
}
1 Like

You should not have to link to the OpenCL.dll. Just to be sure before I dig deeper, you have the OpenCL.lib file in C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64 folder?

Yes

Do you have a 32-bit lib folder with the OpenCL.lib (I think its lib/_x86_x64 or something like this)? Could you try linking with that to see if that changes anything?

We did not do any tests with conda installed mingw, just with what RTools installs by default. I think its mingw-w32 so I am guessing this is an issue that comes up with the 64-bit compiler that conda installs.

Sorry for asking you to debug this for me, I will try this on my Windows workstation in the afternoon. Will try to reproduce this issue with the conda installed toolchain and update the instructions accordingly. Hoping there is a better solution than writing down the path to the .dll, yuck.

Also thanks for testing and sorry for the troubles.

32bit folder

LDFLAGS_OPENCL= -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32" -lOpenCL

C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
...
collect2.exe: error: ld returned 1 exit status
Whole output: 32bit
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32\OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32\OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -lOpenCL
collect2.exe: error: ld returned 1 exit status

x64 folder

LDFLAGS_OPENCL= -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64" -lOpenCL

C:/Users/user/ONEDRI~1/Stan/large_gp2.o:large_gp2.hpp:(.text+0x54): undefined reference to `clGetPlatformInfo'
C:/Users/user/ONEDRI~1/Stan/large_gp2.o:large_gp2.hpp:(.text+0x82): undefined reference to `clGetPlatformInfo'
C:/Users/user/ONEDRI~1/Stan/large_gp2.o:large_gp2.hpp:(.text+0x126): undefined reference to `clGetDeviceInfo'
C:/Users/user/ONEDRI~1/Stan/large_gp2.o:large_gp2.hpp:(.text$_ZN2cl6detail7WrapperIP13_cl_device_idED2Ev[_ZN2cl6detail7WrapperIP13_cl_device_idED2Ev]+0x14): undefined reference to `clReleaseDevice'
...
collect2.exe: error: ld returned 1 exit status

Whole output: 64bit (github gist)

GCC (mingw-w64)

g++ --version

g++ (Rev5, Built by MSYS2 project) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Somewhere [1, 2] on the internet there is some discussion that on windows OpenCL.lib is for msvc and it needs to be transformed to libOpenCL.a before it works with mingw-w64.

[1] https://stackoverflow.com/questions/15185955/compile-opencl-on-mingw-nvidia-sdk
[2] https://community.amd.com/thread/138890

1 Like

Thanks! Really appreciate the effort! Will report back.

Ok, some small updates

CmdStan 2.20 + RTools 35 (W10 + AMD GPU)

-lOpenCL --> need to test later doesn’t work
path_to_opencl.dll --> works

Small question about the code. Does it pre-allocate memory for calculations (input & output) at the start and then re-use the same memory slots or does it create new memory locations each iteration?

edit. updated -lOpenCL option

1 Like

Nice!

It does creation/destruction at each iteration. I’ve thought about reusing the memory but was worried about taking up too much mem and running out of space. We could test that tho’!

the GPU stuff I have done before, eliminating creation / destruction step has made my code much faster (On Python with CuPy).

Can we pretest GPU mem limits?

Thanks for the update! There was a bug on 2.20 that caused linking issues for MPI and OpenCL (https://github.com/stan-dev/cmdstan/issues/718). If possible please check if you will still have issues on develop. Thanks!

In the case of Gaussian Processes the creation/destruction is not noticeable, stuff like cholesky decompose and mdivide_left_tri, etc are those that take the most time. And without keeping every memory buffer on the GPU we can do larger computations, which is probably more of a goal than being a few % faster at borderline input sizes.

For GLMs on the experimental branch we actually do keep the constant data on the GPU. But the datasize grows linearly here and its not such a huge problem if we leave a few 100MB of data on the GPU (at least compared to a 16kx16 matrix that takes up 2GB). Without leaving data on the GPU there would be less of a benefit of using it with GLMs as the iteration time of GLMs is small anyways.

This hasnt been merged yet. Mainly because we want to get this “leave stuff on the GPU” right.

The creation/descrution is more noticeable if we use pinned memory, which we currently dont. I think CUDAs recommend pinned memory or uses it by default. Not sure of what CuPy uses. Pinned memory offers faster transfers but at a cost of a 100x slower creations.

We can check the memory size in OpenCL and could have a sorts caching mechanisms with LRU.

1 Like

I think CuPy is tightly integrated with CUDA.

-lOpenCL option has the following error

src/cmdstan/main.o:main.cpp:(.text+0x190): undefined reference to `clGetPlatformInfo'
src/cmdstan/main.o:main.cpp:(.text+0x1be): undefined reference to `clGetPlatformInfo'
src/cmdstan/main.o:main.cpp:(.text+0x6e68): undefined reference to `clReleaseDevice'

With current github CmdStan (master develop) GPU works with RTools35

make/local

STAN_OPENCL=true
OPENCL_DEVICE_ID=0
OPENCL_PLATFORM_ID=0
CC = g++
LDFLAGS_OPENCL= -L"$(AMDAPPSDKROOT)lib\x86_64" -lOpenCL
1 Like