Now that rstan is using stan version 2.19.1, that means some of the new GPU enabled functions are available in a properly configured rstan yes? This should make them available to the interfaces like brms if rstan is set up correctly? The documentation for the stan gpu routines currently doesn’t have any information about how to set things up for rstan specifically. Where is the rstan equivalent of where I need to put:
I think that is the correct syntax for your ~/.R/Makevars.win but the GPU only comes on if you are taking the Cholesky factorization of a matrix that is 1200 x 1200 or bigger. And I don’t think (m)any brms models are even doing Cholesky factorizations.
Gaussian process models (with the gp() function) will make use of it. Otherwise, cholesky factorization within the sampling process is not done explicitely for brms models (but may be happening by Stan itself behind the scenes).
Do you mean there are no speed-ups before 1200x1200?! Or is that some kind of hard coded threshold ncol(X) > 1200 ? GPU, CPU; (weird pseudo code, haha).
In cholesky_decompose.hpp there’s a check that happens if GPUs are available:
if (m.rows() >= opencl_context.tuning_opts().cholesky_size_worth_transfer)
in which case the operation is done using the GPU. That cholesky_size_worth_transfer is defined as 1250 (I don’t know if it can be controlled from some opencl configuration file) as a compromise between the cost of transferring data to/from the GPU and the speed up gains obtained by using GPUs. For smaller values, the costs exceed the gains, so computations are left on the CPU. I couldn’t find the PR in which that value was chosen, but I’m pretty sure there was some testing behind it.
The big bummer with GPUs in general is the cost of transferring data to and from the GPU is v high. For cholesky, every iteration we need to pass the value and adjoints of the matrix that holds stan’s autodiff class variable var. Though note for the glm methods we have some tricks in the compiler so it goes fast for much smaller problems. We set that threshold when we first started writing the gpu code, but a lot of performance improvements have happened since then so we should probs go back and check if that’s lower now. Though I wouldn’t expect it to be more than 1000 or so.