Mike Lawrence suggested we might be able to make use of either the gpuR package or reach out to the developers for help with GPU integration for Stan.
That might help with R installs—CRAN probably isn’t going to let us bundle up our GPU dependences unless they’re very small. So we might be able ot use it the same way we use BH and RcppEigen.
P.S. I’m not going to post anything to the “developers” tag until it’s open to user comments.
Actually, I am not sure with GPUs. We certainly could go that way if we were relying on ViennaCL. But if @rok_cesnovar is building a non-ViennaCL GPU library, then we will just have to put that into StanHeaders like we do with CVODES (there is an R package that bundles CVODE without the S). Either way, not a big deal.
The amount of additional code for GPU support is not overwhelming and I do not think this will be a problem.
For instance:
In terms of files the core of the GPU lib is in 10 files. After that if we take the example of Cholesky, there are 2 additional files in the stan math /rev/mat/fun and /prim/mat/fun.
Won’t we also need a dependency on the OpenCL GPU library?
We’ve been talking to some mxnet developers about using their sparse matrix libraries if they wind up implementing them with derivatives in a way that won’t lead to a dependency on all of mxnet. They went with CUDA only, claiming that OpenCL didn’t have the performance to be worth coding for. I have no idea what the reality is here. Mxnet is nice for us in that it supports double-precision arithmetic.
But the same goes if you decide to go with any oither library. No matter what they do, the underlying library is either CUDA or OpenCL. To run CUDA you will also need a supported driver (which is not a problem, same as with OpenCL) and the CUDA Toolkit with the nvcc compiler in order to compile the GPU code.
As far as performace goes, CUDA does have some performance advantages on NVIDIA GPUs when fine tuning for specific GPU architectures and NVIDIA has put some focus on better support for deep learning. But I would oppose the statement that is not worth coding in OpenCL, as the performance difference is not that big. Various application like Photoshop, GIMP, Autodesk Maya, LibreOffice, etc. all use OpenCL to speedup their applications.
If the goal is to run Stan on dedicated computers with (multiple) NVIDIA GPUs, than using CUDA (libraries that use CUDA) would probably be the way to go. If the goal is to run Stan faster on a wider range of desktop computers with AMD/NVIDIA/Intel GPUs, than using OpenCL would be the way to go. OpenCL also supports other accelerators like Xeon Phi and is also targeting FPGA.
As far as double-precison arithmetic goes I am a bit confused as both OpenCL and CUDA support double-precision. We were discussing single-precision only in the context of even bigger performance gains, as regular GPUs tend to have twice as many single-precision computing units as double-precison.
In the case that you decide to go with mxnet, can we still count on your support if we would have any questions regarding the base stan math code and combining it with GPU code?
First, I was just verifying that we’d need external GPU libraries, too. Those external libs can be a huge pain in R, which won’t let us bundle them with RStan due to size unless they’re very small. Having multiple such external libs makes installation painful, and it’s already a big enough pain point with Stan.
For 32-bit vs. 64-bit, Tensorflow is concentrating primarily on 32-bit arithmetic from what I can see of Edward and elsewhere and what I heard from asking around. Mxnet is concentrating primarily on 64-bit stuff, which is why it’s relevant, as we want to be able to do things like sparse Cholesky factorization efficiently which is challenging if not impossible with single precision.
We’re only talking to the mxnet folks so far about adding sparse matrix functionality. It sounds like it’ll be orthogonal to whatever we do with you guys. But if they use CUDA and you use OpenCL, it’ll add yet another dependency and probably restrict us to using either sparse or dense operations and not mixing them if that’s even possible.
Everyone keeps telling us all these dependencies are simple, but they’ve proven to be a huge pain for us to manage through R and Python. I don’t know that we’ll even try to get GPUs working through anythnig other than our CmdStan interface on Linux.
Hopefully everyone like you who knows more about this than me will be in on any decision to consolidate efforts, but that’s a long way off.
No promises on any long-term support. We just don’t have the staff to make those kinds of commitments. You’re going to be the expert in Stan math and GPU code, so I don’t know what you’re expecting from the other Stan devs here We will continue to answer questions about the math lib for everyone.
I bought AMD GPUs because of distaste for proprietary CUDA that only runs on hardware for one manufacturer, and more broadly in how restrictive developing a code base in that language becomes.
Although there are tools that make porting CUDA easier, eg HIP.
As for deep learning, at the consumer level AMD Vega GPUs offer excellent Float16 performance. I believe NVIDIA only offers that for their professional GPUs that cost several times more per teraflop. Although I’m more interested in Float32, where both do well.
As for 64-bit – isn’t it only an extremely small subset of GPUs (ie, NVIDIA’s Tesla series) that perform well at all?
C++11 (supporting at least std=c++0x), OpenCL shared library (provided by an SDK such as AMD/NVIDIA) and OpenCL headers including the C++ header file (provided by Khronos if not by SDK)
If the user doesn’t have OpenCL installed locally, they get a compiler error.
It may be, but that’s what we need. The kinds of matrix calculations we’re doing are barely stable with 64 bits, and aren’t stable enough with 32 bits.