Brms: How do I confirm that the GPU is working for speedup by opencl？

Jacob_stanlearner · August 16, 2021, 12:42am

I want to speed up my model by using the opencl function in the dev version of brms. Followed the suggestions in the forum, I can now successfully run this new function. Thanks!

However, I found that the effect of GPU acceleration is not very obvious. When I use “within chain parallelization”, I can obviously feel that the CPU is working and the speed of computing is also accelerated.

My questions are：

How do I know the GPU is working for speedup? At present, I mainly detect the GPU through “Task Manager” Windows 10. I found that when I run the model, the GPU only works by 5%~10%. Is this normal?
Is it true that the “negbinomial model” cannot be accelerated through GPU at present? so the effect is not obvious?
My original dataset is in dataframe format (Rstudio). Do I need some conversion of my data format first to get the acceleration effect?

My code is as below:

fit_serial <- brm(value ~ s(hour,by=period)+weekend+offset(n)+(1|id)+(0+hour|id), 
                  family=negbinomial,
                  data = data, 
                  warmup = 1000, iter = 1200, 
                  cores=2,chains = 2, opencl = opencl(c(0, 0)),
                  backend="cmdstanr")

Operating System: Win 10 pro
brms Version: 2.15.9

Hope to get some advice on this issue! Thanks again!

Best,
Jacob

Jacob_stanlearner · August 16, 2021, 12:22pm

Could I use the “within chain parallelization” and the OpenCL (GPU) together？

rok_cesnovar · August 16, 2021, 12:29pm

It can be the case that most of the computation is spent in other parts of the model that are currently not sped up with GPUs.

Please run

stan_code <- make_stancode(value ~ s(hour,by=period)+weekend+offset(n)+(1|id)+(0+hour|id), 
                  family=negbinomial,
                  data = data, 
                  warmup = 1000, iter = 1200, 
                  cores=2,chains = 2, opencl = opencl(c(0, 0)),
                  backend="cmdstanr")

and paste the stan_code generated.

neg_binomial lpdf Stan functions are GPU supported, but how much speedup can currenly be obtained can vary depending on the model.

If brms accepts it, it doesnt matter.

rok_cesnovar · August 16, 2021, 12:29pm

Not at the moment.

Jacob_stanlearner · August 16, 2021, 12:43pm

Thanks for all your replies!
I just run the following code:

stan_code <- make_stancode(value ~ s(hour,by=period)+weekend+offset(n)+(1|id)+(0+hour|id), 
                           family=negbinomial,
                           data = data, 
                           warmup = 1000, iter = 1200, 
                           cores=2,chains = 2, opencl = opencl(c(0, 0)),
                           backend="cmdstanr")

data <- make_standata(value ~ s(hour,by=period)+weekend+offset(n)+(1|id)+(0+hour|id), 
                      family=negbinomial,
                      data = data, 
                      warmup = 1000, iter = 1200, 
                      cores=2,chains = 2, opencl = opencl(c(0, 0)),
                      backend="cmdstanr")
class(data) <- NULL

file_cl <- write_stan_file(stan_code , hash_salt = "opencl")
mod_cl <- cmdstan_model(file_cl, cpp_options = list(stan_opencl=TRUE))

fit_cl <- mod_cl$sample(data = data2, seed = 123, chains = 4, parallel_chains = 4, opencl_ids = c(0,0))

Then the model can run, in this way, I can get a better effect of GPU speedup than just run a brms style code, right?

And, the new model also return some warning messages, like this:

Chain 3 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 3 Exception: neg_binomial_2_log_glm_lpmf(OpenCL): Precision parameter = 0, but it must be positive finite! (in 'C:/Users/ADMINI~1/AppData/Local/Temp/RtmpaqUfqo/model-3f3029814f8.stan', line 87, column 4 to column 64)
Chain 3 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 3 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

Could I ask for more suggestions for this issue? Thanks!

Jacob_stanlearner · August 16, 2021, 2:34pm

I found another issue:
When I run an example from a post in the forum, see this one:

library(brms)
library(cmdstanr)

n <- 300000
k <- 10
X <- matrix(rnorm(n * k), ncol = k)
y <- rbinom(n, size = 1, prob = plogis(10 * X[,1] + 9 * X[,2] + 8 * X[,3] + 7 * X[,4] + 6 * X[,5] + 5 * X[,6] + 4 * X[,7] + 3 * X[,8] + 2 * X[,9] + 2 * X[,10] + 1))

code <- make_stancode(y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10 , data = data.frame(y, X), family = bernoulli(), refresh = 0)
data <- make_standata(y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10, data = data.frame(y, X), family = bernoulli(), refresh = 0)
class(data) <- NULL

file_cpu <- write_stan_file(code, hash_salt = "cpu")
file_cl <- write_stan_file(code, hash_salt = "opencl")

mod_cpu <- cmdstan_model(file_cpu)
mod_cl <- cmdstan_model(file_cl, cpp_options = list(stan_opencl=TRUE))

fit_cl <- mod_cl$sample(data = data, seed = 123, chains = 4, parallel_chains = 4, opencl_ids = c(0,0))
fit_cpu <- mod_cpu$sample(data = data, seed = 123, chains = 4, parallel_chains = 4)

The speed of “fit_cl” is almost 5 times of “fit_cpu”. More important, I can see that 70% of my GPU is working, which seems to be normal!

However, when I run my real model, the model converted from brms through make_stancode function (see the above reply). I just got the following warnings several times:

Chain 2 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 2 Exception: neg_binomial_2_log_glm_lpmf(OpenCL): Precision parameter = 0, but it must be positive finite! (in 'C:/Users/ADMINI~1/AppData/Local/Temp/RtmpaqUfqo/model-3f3029814f8.stan', line 87, column 4 to column 64)
Chain 2 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 2 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.

And, only 5%-10% GPU is working, which seems to be a little bit strange…Is this because of some issues related to my misuse of brms’ make_stancode function? Maybe I need to find a way to transfer my original dataset to matrix?

Best,
Jacob

rok_cesnovar · August 16, 2021, 7:03pm

No no, this is exactly how brms runs it. I was thinking if you can paste the actual model generated by brms, and would provide the sizes of the vectors. That would give me a better idea of what you are seeing is expected. So just paste the contents of the stan_code variable

Jacob_stanlearner · August 16, 2021, 10:35pm

Copy that！
See the contents of the stan_code variable here:

// generated with brms 2.15.9
functions {
}
data {
  int<lower=1> N;  // total number of observations
  int Y[N];  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for splines
  int Ks;  // number of linear effects
  matrix[N, Ks] Xs;  // design matrix for the linear effects
  // data for spline s(hour, by = period)0
  int nb_1;  // number of bases
  int knots_1[nb_1];  // number of knots
  // basis function matrices
  matrix[N, knots_1[1]] Zs_1_1;
  // data for spline s(hour, by = period)1
  int nb_2;  // number of bases
  int knots_2[nb_2];  // number of knots
  // basis function matrices
  matrix[N, knots_2[1]] Zs_2_1;
  vector[N] offsets;
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  int<lower=1> J_2[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_2_1;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  vector[Ks] bs;  // spline coefficients
  // parameters for spline s(hour, by = period)0
  // standarized spline coefficients
  vector[knots_1[1]] zs_1_1;
  real<lower=0> sds_1_1;  // standard deviations of spline coefficients
  // parameters for spline s(hour, by = period)1
  // standarized spline coefficients
  vector[knots_2[1]] zs_2_1;
  real<lower=0> sds_2_1;  // standard deviations of spline coefficients
  real<lower=0> shape;  // shape parameter
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  vector[N_1] z_1[M_1];  // standardized group-level effects
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations
  vector[N_2] z_2[M_2];  // standardized group-level effects
}
transformed parameters {
  // actual spline coefficients
  vector[knots_1[1]] s_1_1;
  // actual spline coefficients
  vector[knots_2[1]] s_2_1;
  vector[N_1] r_1_1;  // actual group-level effects
  vector[N_2] r_2_1;  // actual group-level effects
  // compute actual spline coefficients
  s_1_1 = sds_1_1 * zs_1_1;
  // compute actual spline coefficients
  s_2_1 = sds_2_1 * zs_2_1;
  r_1_1 = (sd_1[1] * (z_1[1]));
  r_2_1 = (sd_2[1] * (z_2[1]));
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + rep_vector(0.0, N) + Xs * bs + Zs_1_1 * s_1_1 + Zs_2_1 * s_2_1 + offsets;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_2_1[J_2[n]] * Z_2_1[n];
    }
    target += neg_binomial_2_log_glm_lpmf(Y | Xc, mu, b, shape);
  }
  // priors including constants
  target += student_t_lpdf(Intercept | 3, -7.53790018974189, 2.5);
  target += student_t_lpdf(sds_1_1 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  target += std_normal_lpdf(zs_1_1);
  target += student_t_lpdf(sds_2_1 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  target += std_normal_lpdf(zs_2_1);
  target += gamma_lpdf(shape | 0.01, 0.01);
  target += student_t_lpdf(sd_1 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  target += std_normal_lpdf(z_1[1]);
  target += student_t_lpdf(sd_2 | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  target += std_normal_lpdf(z_2[1]);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

Thank you~

Topic		Replies	Views
Using the Apple M1 GPUs: question from a noob brms gpu , splines	46	3883	November 18, 2023
Brms: Error in opencl(c(0, 0)) : could not find function "opencl" brms	4	690	August 15, 2021
OpenCL and brms brms cmdstan , cmdstanr	5	888	May 18, 2021
Brms seems to omit option "cores" while fitting model brms fitting-issues , paralellization	1	1090	April 29, 2023
brms：OpenCL function on MacOS Catalina brms mac , gpu	2	966	November 13, 2023

Brms: How do I confirm that the GPU is working for speedup by opencl？

Related topics