We have a PR in review for Cholesky decomposition(Thanks @rok_cesnovar and @Stevo15025!). There’s one thing that probably needs to be broadcast here - we’re thinking of overloading the existing
stan::math::cholesky_decompose function such that it will run on the GPU if you have the CXXFLAG for GPU set. This happens here: https://github.com/stan-dev/math/pull/1059#diff-4a36159f39687b55fb93027358ecd455R32
Here are some things to consider for this API:
- User interface - we think having a single control point for GPU (adding CXXFLAGS+=-DSTAN_CL` to make/local) is probably the best place to start, though eventually we’d like to add an OpenCL annotation or type to the Stan language itself so that the user has fine-grained oversight what exactly is executed on the GPU.
- Performance - Another PR adds a runtime pointer to Eigen matrices to the cl::Buffer on the GPU when the OpenCL compile flag is on. This means we can avoid extra copies for data matrices (e.g. currently once per leapfrog step). The opencl version of
choleksy_decompose has a rough heuristic for matrix size below which it will just execute the CPU version.
The other favorite was to add a
cholesky_decompose_gpu, but we think this has one significant disadvantage. The majority of time in GPU computation is often spent sending data to and from the GPU, so if you only allow explicit GPU functions, we won’t have the opportunity to use GPU versions when available. With the style of overloading above, we can stay in GPU computation land as long as possible before copying back to the CPU. With both versions, the result is copied back to CPU memory, but with this overloaded version we also keep a pointer to the buffer on the GPU.
We’ve discussed this a few times but I’m not sure if we ever explicitly aired it out on the discourse and had a conclusion, so I wanted to bring it up here. Any additional last-minute considerations welcome.