Cmdstan samples extremely slowly with GPU

When using brms with opencl acceleration, you will only see a benefit if brms generates Stan code which can use or benefit from the acceleration. In Stan, there is a categorical_logit_glm distribution which can be GPU-accelerated. However, brms generates code which uses the categorical_logit distribution (not gpu-accelerated):

library(brms)

tmp_data <- data.frame(outcome = sample(1:4, 10, replace = T),
                      pred = rnorm(10))

make_stancode(outcome ~ pred,
              data = tmp_data,
              family = categorical("logit"),
              backend = "cmdstanr",
              opencl = opencl(c(0,0)))

Produces:

...
    for (n in 1 : N) {
      target += categorical_logit_lpmf(Y[n] | mu[n]);
    } 

This is because the categorical_logit_glm distribution is not available in the current version of rstan, and brms has to remain compatible with both. Note that this is also mentioned in the brms::opencl documentation:
Only some Stan functions can be run on a GPU at this point and so a lot of brms models won't benefit from OpenCL for now.

If you’re going to be working with very large datasets that require days of computation time, you should most likely look to use Stan code itself (through cmdstanr or similar) and tune/optimise as needed, as brms has to generate code for maximum flexibility and compatibility, rather than speed and efficiency.

Note that this discussion has strayed from the original topic, so I’d recommend opening a new topic if you have any more questions

5 Likes