Stan GPU flags

ahartikainen · August 2, 2019, 7:36am

Hi, I have couple of things to ask / mention.

GPU flags on Windows (CmdStan)
Needed flags for Stan-math (PyStan)

CmdStan

I was able to get GPU running on Windows with CmdStan (CmdStanPy interface) following the instructions in https://github.com/stan-dev/math/wiki/OpenCL-GPU-Routines

I only needed to change LDFLAGS_OPENCL= -L"$(CUDA_PATH)\lib\x64" -lOpenCL

from

LDFLAGS_OPENCL= -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64" -lOpenCL

to

LDFLAGS_OPENCL= -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64" C:/Windows/System32/OpenCL.dll

Could someone explain why this works and -lOpenCL fails?

(Used conda installed mingw-w64 (gcc) and mingw32-make: conda install m2w64-toolchain -c msys2)
(Short path is from cmdstanpy.utils.windows_short_path)

PyStan

I then tried to do the same with PyStan (added stan-math OpenCl to path inside pystan/model.py, and added extra_link_args)

My input is

stan_model = pystan.StanModel(
    model_code=stan_code, 
    extra_compile_args = ["-DSTAN_OPENCL",
                                        "-DOPENCL_DEVICE_ID=0",
                                        "-DOPENCL_PLATFORM_ID=0",
                                       ],
    extra_link_args = ['-L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64"',
                                 "C:/Windows/System32/OpenCL.dll",
                                 ], 
    verbose=True)

The output is

Compiling C:\Users\user\AppData\Local\Temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.pyx because it changed.
[1/1] Cythonizing C:\Users\user\AppData\Local\Temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.pyx
building 'stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004' extension
C:\Users\user\miniconda3\envs\stan\Library\mingw-w64\bin\gcc.exe -mdll -O -Wall -DMS_WIN64 -DBOOST_RESULT_OF_USE_TR1 -DBOOST_NO_DECLTYPE -DBOOST_DISABLE_ASSERTS -IC:\Users\user\AppData\Local\Temp\tmp0rlxxu53 -Ic:\users\user\github\pystan\pystan -Ic:\users\user\github\pystan\pystan\stan\src -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math\lib\eigen_3.3.3 -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math\lib\boost_1.69.0 -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math\lib\sundials_4.1.0\include -Ic:\users\user\github\pystan\pystan\stan\lib\stan_math\lib\opencl_1.2.8 -IC:\Users\user\miniconda3\envs\stan\lib\site-packages\numpy\core\include -IC:\Users\user\miniconda3\envs\stan\include -IC:\Users\user\miniconda3\envs\stan\include -c C:\Users\user\AppData\Local\Temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.cpp -o c:\users\user\appdata\local\temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.o -O2 -ftemplate-depth-256 -Wno-unused-function -Wno-uninitialized -std=c++1y -D_hypot=hypot -pthread -fexceptions -DSTAN_OPENCL -DOPENCL_DEVICE_ID=0 -DOPENCL_PLATFORM_ID=0
writing c:\users\user\appdata\local\temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.cp37-win_amd64.def
C:\Users\user\miniconda3\envs\stan\Library\mingw-w64\bin\g++.exe -shared -s c:\users\user\appdata\local\temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.o c:\users\user\appdata\local\temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.cp37-win_amd64.def -LC:\Users\user\miniconda3\envs\stan\libs -LC:\Users\user\miniconda3\envs\stan\PCbuild\amd64 -lpython37 -lmsvcr140 -o C:\Users\user\AppData\Local\Temp\tmp0rlxxu53\stanfit4anon_model_9d0097b8e1c832bbbae3662f9bcf36e4_8566498776977557004.cp37-win_amd64.pyd -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64" C:/Windows/System32/OpenCL.dll

This gets stuck in the linking step, not sure what is the problem.

cc @ariddell

What are the needed flags for Stan-math GPU

Model

Model is a GP with cholesky_decompose

data {
  int<lower=1> N;
  real x[N];
  vector[N] y;
}
transformed data {
  vector[N] mu = rep_vector(0, N);
}
parameters {
  real<lower=0> rho;
  real<lower=0> alpha;
  real<lower=0> sigma;
}
model {
  matrix[N, N] L_K;
  matrix[N, N] K = cov_exp_quad(x, alpha, rho);
  real sq_sigma = square(sigma);

  // diagonal elements
  for (n in 1:N)
    K[n, n] = K[n, n] + sq_sigma;

  L_K = cholesky_decompose(K);

  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  sigma ~ std_normal();

  y ~ multi_normal_cholesky(mu, L_K);
}

rok_cesnovar · August 2, 2019, 8:17am

You should not have to link to the OpenCL.dll. Just to be sure before I dig deeper, you have the OpenCL.lib file in C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64 folder?

ahartikainen · August 2, 2019, 8:19am

Yes

rok_cesnovar · August 2, 2019, 8:40am

Do you have a 32-bit lib folder with the OpenCL.lib (I think its lib/_x86_x64 or something like this)? Could you try linking with that to see if that changes anything?

We did not do any tests with conda installed mingw, just with what RTools installs by default. I think its mingw-w32 so I am guessing this is an issue that comes up with the 64-bit compiler that conda installs.

Sorry for asking you to debug this for me, I will try this on my Windows workstation in the afternoon. Will try to reproduce this issue with the conda installed toolchain and update the instructions accordingly. Hoping there is a better solution than writing down the path to the .dll, yuck.

Also thanks for testing and sorry for the troubles.

ahartikainen · August 2, 2019, 1:47pm

32bit folder

LDFLAGS_OPENCL= -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32" -lOpenCL

C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
...
collect2.exe: error: ld returned 1 exit status

Whole output: 32bit

C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32\OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32/OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: skipping incompatible C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/Win32\OpenCL.lib when searching for -lOpenCL
C:/Users/user/miniconda3/envs/stan/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/bin/ld.exe: cannot find -lOpenCL
collect2.exe: error: ld returned 1 exit status

x64 folder

LDFLAGS_OPENCL= -L"C:/PROGRA~1/NVIDIA~2/CUDA/v10.1/lib/x64" -lOpenCL

C:/Users/user/ONEDRI~1/Stan/large_gp2.o:large_gp2.hpp:(.text+0x54): undefined reference to `clGetPlatformInfo'
C:/Users/user/ONEDRI~1/Stan/large_gp2.o:large_gp2.hpp:(.text+0x82): undefined reference to `clGetPlatformInfo'
C:/Users/user/ONEDRI~1/Stan/large_gp2.o:large_gp2.hpp:(.text+0x126): undefined reference to `clGetDeviceInfo'
C:/Users/user/ONEDRI~1/Stan/large_gp2.o:large_gp2.hpp:(.text$_ZN2cl6detail7WrapperIP13_cl_device_idED2Ev[_ZN2cl6detail7WrapperIP13_cl_device_idED2Ev]+0x14): undefined reference to `clReleaseDevice'
...
collect2.exe: error: ld returned 1 exit status

Whole output: 64bit (github gist)

GCC (mingw-w64)

g++ --version

g++ (Rev5, Built by MSYS2 project) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ahartikainen · August 2, 2019, 1:53pm

Somewhere [1, 2] on the internet there is some discussion that on windows OpenCL.lib is for msvc and it needs to be transformed to libOpenCL.a before it works with mingw-w64.

[1] https://stackoverflow.com/questions/15185955/compile-opencl-on-mingw-nvidia-sdk
[2] https://community.amd.com/thread/138890

rok_cesnovar · August 2, 2019, 2:02pm

Thanks! Really appreciate the effort! Will report back.

ahartikainen · August 10, 2019, 1:13pm

Ok, some small updates

CmdStan 2.20 + RTools 35 (W10 + AMD GPU)

-lOpenCL --> ~~need to test later~~ doesn’t work
path_to_opencl.dll --> works

Small question about the code. Does it pre-allocate memory for calculations (input & output) at the start and then re-use the same memory slots or does it create new memory locations each iteration?

edit. updated -lOpenCL option

stevebronder · August 10, 2019, 1:46pm

Nice!

It does creation/destruction at each iteration. I’ve thought about reusing the memory but was worried about taking up too much mem and running out of space. We could test that tho’!

ahartikainen · August 10, 2019, 1:55pm

the GPU stuff I have done before, eliminating creation / destruction step has made my code much faster (On Python with CuPy).

Can we pretest GPU mem limits?

rok_cesnovar · August 10, 2019, 2:42pm

Thanks for the update! There was a bug on 2.20 that caused linking issues for MPI and OpenCL (https://github.com/stan-dev/cmdstan/issues/718). If possible please check if you will still have issues on develop. Thanks!

In the case of Gaussian Processes the creation/destruction is not noticeable, stuff like cholesky decompose and mdivide_left_tri, etc are those that take the most time. And without keeping every memory buffer on the GPU we can do larger computations, which is probably more of a goal than being a few % faster at borderline input sizes.

For GLMs on the experimental branch we actually do keep the constant data on the GPU. But the datasize grows linearly here and its not such a huge problem if we leave a few 100MB of data on the GPU (at least compared to a 16kx16 matrix that takes up 2GB). Without leaving data on the GPU there would be less of a benefit of using it with GLMs as the iteration time of GLMs is small anyways.

This hasnt been merged yet. Mainly because we want to get this “leave stuff on the GPU” right.

The creation/descrution is more noticeable if we use pinned memory, which we currently dont. I think CUDAs recommend pinned memory or uses it by default. Not sure of what CuPy uses. Pinned memory offers faster transfers but at a cost of a 100x slower creations.

We can check the memory size in OpenCL and could have a sorts caching mechanisms with LRU.

ahartikainen · August 10, 2019, 6:02pm

I think CuPy is tightly integrated with CUDA.

-lOpenCL option has the following error

src/cmdstan/main.o:main.cpp:(.text+0x190): undefined reference to `clGetPlatformInfo'
src/cmdstan/main.o:main.cpp:(.text+0x1be): undefined reference to `clGetPlatformInfo'
src/cmdstan/main.o:main.cpp:(.text+0x6e68): undefined reference to `clReleaseDevice'

ahartikainen · August 10, 2019, 6:35pm

With current github CmdStan (~~master~~ develop) GPU works with RTools35

make/local

STAN_OPENCL=true
OPENCL_DEVICE_ID=0
OPENCL_PLATFORM_ID=0
CC = g++
LDFLAGS_OPENCL= -L"$(AMDAPPSDKROOT)lib\x86_64" -lOpenCL

Topic		Replies	Views
Compiling CmdStan 2.28.0 with OpenCL fails on Windows 10 General cmdstan , installation , gpu , cmdstanr	8	1227	October 18, 2021
Failed to run opencl demo on a Windows machine CmdStan techniques	2	807	January 14, 2022
Help with OpenCL for Windows CmdStan	2	864	November 30, 2022
Stan is not working on GPU in Linux Developers gpu	8	1673	March 17, 2021
Compiling CmdStan 2.24.1 with STAN_OPENCL=true Interfaces cmdstan , gpu	3	689	October 14, 2020