Managing memory with OpenCL

JLC · February 17, 2021, 4:53pm

After some fantastic guidance from @andrjohns, I have been experimenting with re-fitting some previous models using OpenCL and/or reduce_sum.

When I use OpenCL (stan_opencl = TRUE), I have routinely been running out of memory (64gb available) resulting in a system crash.

The models I have been fitting aren’t particularly large (4 chains, 1000 warmup, 2000 sampling). Is this just a consequence of transferring between CPU and GPU?

Is there a way to structure my workflow to permit GPU use or do I simply need more RAM?

rok_cesnovar · February 17, 2021, 4:58pm

Hi,

what is the error you are seing? Is the memory running out on the CPU or is it reported by OpenCL?

JLC · February 17, 2021, 5:01pm

It’s not a reported error, per se.

I’m on a Ubuntu system and have a system monitor running. As the model progresses, I see my memory use gradually increase to 100% then swap space begins to fill and the system freezes due to no available memory.

rok_cesnovar · February 17, 2021, 5:05pm

Hm, are you able to share an example with the model and fake data? The combination of reduce_sum and OpenCL has not really been used much.

JLC · February 17, 2021, 5:15pm

Sure!

This is without reduce_sum. Just using stan_opencl = TRUE, 4 chains and 4 cores.

rok_cesnovar · February 17, 2021, 5:35pm

Does the same thing happen with 1 chain?

JLC · February 18, 2021, 12:09am

Here’s a reprex:
Original data:
Sub_Data.csv (339.3 KB)

Stan data:
Data_dput.txt (2.2 MB)

# Load data
CmdStan_Data <- dget(file = "Data_dput.txt")

# Define model
CmdStan_Model <- 
  "functions {
  /* compute correlated group-level effects
  * Args: 
    *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns: 
    *   matrix of scaled group-level effects
  */ 
    matrix scale_r_cor(matrix z, vector SD, matrix L) {
      // r is stored in another dimension order than z
      return transpose(diag_pre_multiply(SD, L) * z);
    }
}
data {
  int<lower=1> N;  // total number of observations
  real Y[N];  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  vector[N] Z_1_3;
  vector[N] Z_1_4;
  vector[N] Z_1_5;
  vector[N] Z_1_6;
  vector[N] Z_1_7;
  vector[N] Z_1_8;
  vector[N] Z_1_9;
  int<lower=1> NC_1;  // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  int<lower=1> J_2[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_2_1;
  vector[N] Z_2_2;
  vector[N] Z_2_3;
  vector[N] Z_2_4;
  vector[N] Z_2_5;
  vector[N] Z_2_6;
  vector[N] Z_2_7;
  vector[N] Z_2_8;
  vector[N] Z_2_9;
  int<lower=1> NC_2;  // number of group-level correlations
  // data for group-level effects of ID 3
  int<lower=1> N_3;  // number of grouping levels
  int<lower=1> M_3;  // number of coefficients per level
  int<lower=1> J_3[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_3_1;
  vector[N] Z_3_2;
  vector[N] Z_3_3;
  vector[N] Z_3_4;
  vector[N] Z_3_5;
  vector[N] Z_3_6;
  vector[N] Z_3_7;
  vector[N] Z_3_8;
  vector[N] Z_3_9;
  int<lower=1> NC_3;  // number of group-level correlations
  // data for group-level effects of ID 4
  int<lower=1> N_4;  // number of grouping levels
  int<lower=1> M_4;  // number of coefficients per level
  int<lower=1> J_4[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_4_1;
  // data for group-level effects of ID 5
  int<lower=1> N_5;  // number of grouping levels
  int<lower=1> M_5;  // number of coefficients per level
  int<lower=1> J_5[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_5_sigma_1;
  // data for group-level effects of ID 6
  int<lower=1> N_6;  // number of grouping levels
  int<lower=1> M_6;  // number of coefficients per level
  int<lower=1> J_6[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_6_nu_1;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  matrix[N,9] Z_1 = [Z_1_1', Z_1_2', Z_1_3', Z_1_4', Z_1_5',
                     Z_1_6', Z_1_7', Z_1_8', Z_1_9']';
  matrix[N,9] Z_2 = [Z_2_1', Z_2_2', Z_2_3', Z_2_4', Z_2_5',
                     Z_2_6', Z_2_7', Z_2_8', Z_2_9']';
  matrix[N,9] Z_3 = [Z_3_1', Z_3_2', Z_3_3', Z_3_4', Z_3_5',
                     Z_3_6', Z_3_7', Z_3_8', Z_3_9']';
  int b_index[37] = linspaced_int_array(37, 2, 38);
}
parameters {
  vector[K] b;  // population-level effects
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations
  matrix[M_2, N_2] z_2;  // standardized group-level effects
  cholesky_factor_corr[M_2] L_2;  // cholesky factor of correlation matrix
  vector<lower=0>[M_3] sd_3;  // group-level standard deviations
  matrix[M_3, N_3] z_3;  // standardized group-level effects
  cholesky_factor_corr[M_3] L_3;  // cholesky factor of correlation matrix
  vector<lower=0>[M_4] sd_4;  // group-level standard deviations
  vector[N_4] z_4[M_4];  // standardized group-level effects
  vector<lower=0>[M_5] sd_5;  // group-level standard deviations
  vector[N_5] z_5[M_5];  // standardized group-level effects
  vector<lower=0>[M_6] sd_6;  // group-level standard deviations
  vector[N_6] z_6[M_6];  // standardized group-level effects
}
transformed parameters {
  // actual group-level effects
  matrix[N_1, M_1] r_1 = scale_r_cor(z_1, sd_1, L_1);
  matrix[N_2, M_2] r_2 = scale_r_cor(z_2, sd_2, L_2);
  matrix[N_3, M_3] r_3 = scale_r_cor(z_3, sd_3, L_3); 

  vector[N_4] r_4_1 = (sd_4[1] * (z_4[1]));
  vector[N_5] r_5_sigma_1 = (sd_5[1] * (z_5[1]));
  vector[N_6] r_6_nu_1 = (sd_6[1] * (z_6[1]));
}
model {
  int grainsize = 1;
  // likelihood not including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = X * b + rows_dot_product(r_1[J_1], Z_1)
                         + rows_dot_product(r_2[J_2], Z_2)
                         + rows_dot_product(r_3[J_3], Z_3)
                         + r_4_1[J_4] .* Z_4_1;
    // initialize linear predictor term
    vector[N] sigma = exp(r_5_sigma_1[J_5] .* Z_5_sigma_1);
    // initialize linear predictor term
    vector[N] nu = exp(r_6_nu_1[J_6] .* Z_6_nu_1) + 1;

    target += student_t_lpdf(Y | nu, mu, sigma);
  }
  // priors not including constants
  b[1] ~ normal(5,5);
  b[b_index] ~ normal(0,3);
  sd_1 ~ normal(1,1);
  sd_2 ~ normal(1,1);
  sd_3 ~ normal(0,0.5);
  sd_4[1] ~ normal(0,0.5);
  sd_5[1] ~ normal(0,0.1);
  sd_6[1] ~ normal(0,1.5);
  to_vector(z_1) ~ std_normal();
  to_vector(z_2) ~ std_normal();
  to_vector(z_3) ~ std_normal();
  z_4[1] ~ std_normal();
  z_5[1] ~ std_normal();
  z_6[1] ~ std_normal();
  L_1 ~ lkj_corr_cholesky(1);
  L_2 ~ lkj_corr_cholesky(1);
  L_3 ~ lkj_corr_cholesky(1);
}
generated quantities {
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // compute group-level correlations
  corr_matrix[M_2] Cor_2 = multiply_lower_tri_self_transpose(L_2);
  vector<lower=-1,upper=1>[NC_2] cor_2;
  // compute group-level correlations
  corr_matrix[M_3] Cor_3 = multiply_lower_tri_self_transpose(L_3);
  vector<lower=-1,upper=1>[NC_3] cor_3;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
  // extract upper diagonal of correlation matrix
  for (k in 1:M_2) {
    for (j in 1:(k - 1)) {
      cor_2[choose(k - 1, 2) + j] = Cor_2[j, k];
    }
  }
  // extract upper diagonal of correlation matrix
  for (k in 1:M_3) {
    for (j in 1:(k - 1)) {
      cor_3[choose(k - 1, 2) + j] = Cor_3[j, k];
    }
  }
}
"

# Set compile options
acc_options = list(
  stan_opencl = TRUE,
  opencl_platform_id = 0, 
  opencl_device_id = 0
)

# Compile model
CmdStan_Mod <- cmdstan_model(write_stan_file(CmdStan_Model), 
                                  compile = TRUE,
                                  cpp_options = acc_options)

CmdStan_Fit <- CmdStan_Mod$sample(
        data = CmdStan_Data, 
        iter_warmup = 1000,            
        iter_sampling = 2000,  
        chains = 4,           
        max_treedepth=12, 
        adapt_delta=0.9)

JLC · February 18, 2021, 10:10pm

It does also occur with 1 chain, but more slowly. Memory consumption climbs about .1% every second in each chain and will keep climbing until memory is full.

JLC · February 26, 2021, 2:51pm

I’m still seeing this behaviour after upgrading to the latest point release of CmdStan.

Could this be a memory leak or is this expected behaviour?

rok_cesnovar · February 26, 2021, 8:42pm

This does look like a leak, but cant really say for certaing. Added to my todo items, will get back to you, hopefully early next week.

JLC · March 6, 2021, 2:57pm

To provide another data point, I’m seeing the same behaviour using stan_opencl = TRUE with a Beta model.

stevebronder · March 6, 2021, 10:01pm

I’ll try to take a look at this during the week. @rok_cesnovar we can use something like heaptracker to try to sort out where this is going on. My guess is that we are allocating something on the stack allocator that needs to have it’s destructor called. Would the clang mem analyzer work for OpenCL stuff?

Looking at that model, where is the OpenCL stuff used at?

JLC · March 6, 2021, 11:52pm

My apologies if this isn’t helpful; I’m nowhere near being a developer.

I think dot products when target += student_t_lpdf(Y | nu, mu, sigma); is called move to the GPU.

In my most recent example, I call target += beta_lpdf(Y | mu * phi, (1 - mu) * phi);. There is no dot product calculation and the GPU usage is low, but memory consumption does the same gradual creep.

stevebronder · March 7, 2021, 12:05am

Oh yes sorry! That wasn’t calling you out I was asking Rok where the GPU stuff was being called here because I couldn’t remember which parts had access to it. This example is very helpful and useful!

rok_cesnovar · March 8, 2021, 1:49pm

Right now, in 2.26 its only used for _lpdf/_lpmf calls, so in the above case only for the student_t_lpdf. Data is transferred in the constructor, and parameters are to_matrix_cl’ed every iteration in the function call.

stevebronder · March 8, 2021, 5:37pm

Okay then I think the issue might be in the matrix_cl<> edge for operands_and_partials. I think we just need to change the line to using partials_t = arena_t<plain_type_t<Op>>;

tadej · March 9, 2021, 1:25pm

I think this could be a bug that was fixed after the release in Template from matrix cl by t4c1 · Pull Request #2345 · stan-dev/math · GitHub. @JLC Can you try to run the model with develop version of Stan Math to check if the issue persists?

JLC · March 9, 2021, 2:49pm

It looks like the issue is still there.

I am on the develop branch of cmdstan 2.26.1

and pulled the latest develop branch of stan_math:

cd cmdstan/stan/lib/stan_math
git checkout develop

Then rebuilt cmdstan

make -j8 build

Re-ran the model and still see the memory consumption ticking up steadily.

JLC · March 29, 2021, 8:39pm

Hi all,

Is there any other information I can provide or debugging I can attempt to assist with this issue?

rok_cesnovar · March 29, 2021, 8:44pm

Sorry, we havent posted here, but the leak reason was found and will be fixe shortly: Clear read/write event for the global matrix_cls · Issue #857 · stan-dev/stanc3 · GitHub

Thanks for your report!

Topic		Replies	Views
CmdStan OpenCL GPU problems and wiki page Developers	59	1967	January 29, 2020
Memory issues with custom model Modeling windows	5	687	March 10, 2021
OpenCL & threading supported at the same time? Developers	9	1375	March 11, 2022
Running Stan on the GPU with OpenCL on WSL: Seeking Assistance CmdStan linux , techniques , gpu	18	830	February 23, 2025
Issue using external GPU with cmdstanr and OpenCL Modeling fitting-issues	5	468	October 28, 2023

Managing memory with OpenCL

Related topics