Stan on the GPU using OpenCL/ViennaCL
We have started discussing this topic in an e-mail thread with Bob and Steve and we decide to move the discussion to the forums.
Quick intro:
Me and Erik Štrumbelj from University of Ljubljana recieved a grant for a 3 year project where we are trying to speedup algorithms for Bayesian inference on the GPU. In the first part of the project we created a R package for the execution of Bayesian lasso and Multinomial Logistic Regression on the GPU. We have gained speedups of 100 to 150-fold compared to the sequential implementations in C/C++. In the next 2 years we plan to focus on using GPUs to speedup the Stan C++ code. Our plan is to use OpenCL in order to be able to run the software on GPUs of all major vendors (NVIDIA, ATI/AMD, Intel).
Final goal: Make Stan run 100 times faster for models/data of mid to high complexity/size on mid to high-range GPUs. And, if possible, to not run slower on smaller problems.
We all seem to agree that the best approach is to extend the Stan compiler so that it compiles GPU code - OpenCL or one of its libraries (ViennaCL,… ). We generally like ViennaCL and in the short term it seems as the easiest way to getting some speedups. We do however have some concerns we should discuss:
- How quickly can new additions be integrated - We will probably have a lot of non-standard matrix/vector operations that need to be parallelized and executed on the GPU(for instance adding to diagonal, etc …)?
- How optimized is ViennaCL for dense matrix operations? We have created our own Cholesky decomposition that still has some room for improvement as it was done quickly. The code is currently almost twice as fast as the ViennaCL for Cholesky. (~80-fold faster compared to LLT and ~40-fold compared to LU with a AMD HD7570 (mid to high range relatively old gpu) ). I have posted our code/results on Dropbox: https://www.dropbox.com/s/foyzn6llhpifwvf/eigen_vs_vienna_dense_lu.tar.gz?dl=0. To compare I used the code Steve posted a week ago. The results are in the HD7570.txt and K20m.txt files (I am a new user so I could not attach files).
The other question we should maybe discuss is how much of the computation to move to GPU?
Parallelize only parts of the code (matrix/vector operations, embarrassingly parallel for loops…)
- least amount of work
- a lot of data transfers from and to the GPU
Move all posterior/gradient/transformation/generated quantities calculation to GPU?
- only parameters need to be transferred, data can be transferred to GPU only once
- we’d need a GPU alternative for everything that Stan does
A hybrid of the two above?
- It would not be costly to detect during compilation what data can be permanently moved to GPU and what partially computed quantities can remain on GPU. Maybe this can be done very effectively.
Move everything to GPU (including NUTS sampler)?
- No data transfers.
- Data transfer benefits probably won’t justify the added complexity of parallelizing the sampler.
- Would have to do it again for each sampler.
The third question that we had is if we are restricted to double precision arithmetic? We could for instance copy a double array to a float array and compute with floats on a GPU and then copy the results back to a double array. What is the largest error that is still acceptable? If this would be possible that would bring another 2-fold speedup on most GPUs (non-computing-only).