Managing memory with OpenCL

After some fantastic guidance from @andrjohns, I have been experimenting with re-fitting some previous models using OpenCL and/or reduce_sum.

When I use OpenCL (stan_opencl = TRUE), I have routinely been running out of memory (64gb available) resulting in a system crash.

The models I have been fitting aren’t particularly large (4 chains, 1000 warmup, 2000 sampling). Is this just a consequence of transferring between CPU and GPU?

Is there a way to structure my workflow to permit GPU use or do I simply need more RAM?

Hi,

what is the error you are seing? Is the memory running out on the CPU or is it reported by OpenCL?

It’s not a reported error, per se.

I’m on a Ubuntu system and have a system monitor running. As the model progresses, I see my memory use gradually increase to 100% then swap space begins to fill and the system freezes due to no available memory.

Hm, are you able to share an example with the model and fake data? The combination of reduce_sum and OpenCL has not really been used much.

Sure!

This is without reduce_sum. Just using stan_opencl = TRUE, 4 chains and 4 cores.

Does the same thing happen with 1 chain?

Here’s a reprex:
Original data:
Sub_Data.csv (339.3 KB)

Stan data:
Data_dput.txt (2.2 MB)

# Load data
CmdStan_Data <- dget(file = "Data_dput.txt")
# Define model
CmdStan_Model <- 
  "functions {
  /* compute correlated group-level effects
  * Args: 
    *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns: 
    *   matrix of scaled group-level effects
  */ 
    matrix scale_r_cor(matrix z, vector SD, matrix L) {
      // r is stored in another dimension order than z
      return transpose(diag_pre_multiply(SD, L) * z);
    }
}
data {
  int<lower=1> N;  // total number of observations
  real Y[N];  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  vector[N] Z_1_3;
  vector[N] Z_1_4;
  vector[N] Z_1_5;
  vector[N] Z_1_6;
  vector[N] Z_1_7;
  vector[N] Z_1_8;
  vector[N] Z_1_9;
  int<lower=1> NC_1;  // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  int<lower=1> J_2[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_2_1;
  vector[N] Z_2_2;
  vector[N] Z_2_3;
  vector[N] Z_2_4;
  vector[N] Z_2_5;
  vector[N] Z_2_6;
  vector[N] Z_2_7;
  vector[N] Z_2_8;
  vector[N] Z_2_9;
  int<lower=1> NC_2;  // number of group-level correlations
  // data for group-level effects of ID 3
  int<lower=1> N_3;  // number of grouping levels
  int<lower=1> M_3;  // number of coefficients per level
  int<lower=1> J_3[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_3_1;
  vector[N] Z_3_2;
  vector[N] Z_3_3;
  vector[N] Z_3_4;
  vector[N] Z_3_5;
  vector[N] Z_3_6;
  vector[N] Z_3_7;
  vector[N] Z_3_8;
  vector[N] Z_3_9;
  int<lower=1> NC_3;  // number of group-level correlations
  // data for group-level effects of ID 4
  int<lower=1> N_4;  // number of grouping levels
  int<lower=1> M_4;  // number of coefficients per level
  int<lower=1> J_4[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_4_1;
  // data for group-level effects of ID 5
  int<lower=1> N_5;  // number of grouping levels
  int<lower=1> M_5;  // number of coefficients per level
  int<lower=1> J_5[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_5_sigma_1;
  // data for group-level effects of ID 6
  int<lower=1> N_6;  // number of grouping levels
  int<lower=1> M_6;  // number of coefficients per level
  int<lower=1> J_6[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_6_nu_1;
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  matrix[N,9] Z_1 = [Z_1_1', Z_1_2', Z_1_3', Z_1_4', Z_1_5',
                     Z_1_6', Z_1_7', Z_1_8', Z_1_9']';
  matrix[N,9] Z_2 = [Z_2_1', Z_2_2', Z_2_3', Z_2_4', Z_2_5',
                     Z_2_6', Z_2_7', Z_2_8', Z_2_9']';
  matrix[N,9] Z_3 = [Z_3_1', Z_3_2', Z_3_3', Z_3_4', Z_3_5',
                     Z_3_6', Z_3_7', Z_3_8', Z_3_9']';
  int b_index[37] = linspaced_int_array(37, 2, 38);
}
parameters {
  vector[K] b;  // population-level effects
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations
  matrix[M_2, N_2] z_2;  // standardized group-level effects
  cholesky_factor_corr[M_2] L_2;  // cholesky factor of correlation matrix
  vector<lower=0>[M_3] sd_3;  // group-level standard deviations
  matrix[M_3, N_3] z_3;  // standardized group-level effects
  cholesky_factor_corr[M_3] L_3;  // cholesky factor of correlation matrix
  vector<lower=0>[M_4] sd_4;  // group-level standard deviations
  vector[N_4] z_4[M_4];  // standardized group-level effects
  vector<lower=0>[M_5] sd_5;  // group-level standard deviations
  vector[N_5] z_5[M_5];  // standardized group-level effects
  vector<lower=0>[M_6] sd_6;  // group-level standard deviations
  vector[N_6] z_6[M_6];  // standardized group-level effects
}
transformed parameters {
  // actual group-level effects
  matrix[N_1, M_1] r_1 = scale_r_cor(z_1, sd_1, L_1);
  matrix[N_2, M_2] r_2 = scale_r_cor(z_2, sd_2, L_2);
  matrix[N_3, M_3] r_3 = scale_r_cor(z_3, sd_3, L_3); 

  vector[N_4] r_4_1 = (sd_4[1] * (z_4[1]));
  vector[N_5] r_5_sigma_1 = (sd_5[1] * (z_5[1]));
  vector[N_6] r_6_nu_1 = (sd_6[1] * (z_6[1]));
}
model {
  int grainsize = 1;
  // likelihood not including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = X * b + rows_dot_product(r_1[J_1], Z_1)
                         + rows_dot_product(r_2[J_2], Z_2)
                         + rows_dot_product(r_3[J_3], Z_3)
                         + r_4_1[J_4] .* Z_4_1;
    // initialize linear predictor term
    vector[N] sigma = exp(r_5_sigma_1[J_5] .* Z_5_sigma_1);
    // initialize linear predictor term
    vector[N] nu = exp(r_6_nu_1[J_6] .* Z_6_nu_1) + 1;

    target += student_t_lpdf(Y | nu, mu, sigma);
  }
  // priors not including constants
  b[1] ~ normal(5,5);
  b[b_index] ~ normal(0,3);
  sd_1 ~ normal(1,1);
  sd_2 ~ normal(1,1);
  sd_3 ~ normal(0,0.5);
  sd_4[1] ~ normal(0,0.5);
  sd_5[1] ~ normal(0,0.1);
  sd_6[1] ~ normal(0,1.5);
  to_vector(z_1) ~ std_normal();
  to_vector(z_2) ~ std_normal();
  to_vector(z_3) ~ std_normal();
  z_4[1] ~ std_normal();
  z_5[1] ~ std_normal();
  z_6[1] ~ std_normal();
  L_1 ~ lkj_corr_cholesky(1);
  L_2 ~ lkj_corr_cholesky(1);
  L_3 ~ lkj_corr_cholesky(1);
}
generated quantities {
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // compute group-level correlations
  corr_matrix[M_2] Cor_2 = multiply_lower_tri_self_transpose(L_2);
  vector<lower=-1,upper=1>[NC_2] cor_2;
  // compute group-level correlations
  corr_matrix[M_3] Cor_3 = multiply_lower_tri_self_transpose(L_3);
  vector<lower=-1,upper=1>[NC_3] cor_3;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
  // extract upper diagonal of correlation matrix
  for (k in 1:M_2) {
    for (j in 1:(k - 1)) {
      cor_2[choose(k - 1, 2) + j] = Cor_2[j, k];
    }
  }
  // extract upper diagonal of correlation matrix
  for (k in 1:M_3) {
    for (j in 1:(k - 1)) {
      cor_3[choose(k - 1, 2) + j] = Cor_3[j, k];
    }
  }
}
"
# Set compile options
acc_options = list(
  stan_opencl = TRUE,
  opencl_platform_id = 0, 
  opencl_device_id = 0
)

# Compile model
CmdStan_Mod <- cmdstan_model(write_stan_file(CmdStan_Model), 
                                  compile = TRUE,
                                  cpp_options = acc_options)
CmdStan_Fit <- CmdStan_Mod$sample(
        data = CmdStan_Data, 
        iter_warmup = 1000,            
        iter_sampling = 2000,  
        chains = 4,           
        max_treedepth=12, 
        adapt_delta=0.9)

It does also occur with 1 chain, but more slowly. Memory consumption climbs about .1% every second in each chain and will keep climbing until memory is full.

I’m still seeing this behaviour after upgrading to the latest point release of CmdStan.

Could this be a memory leak or is this expected behaviour?

This does look like a leak, but cant really say for certaing. Added to my todo items, will get back to you, hopefully early next week.

1 Like

To provide another data point, I’m seeing the same behaviour using stan_opencl = TRUE with a Beta model.

I’ll try to take a look at this during the week. @rok_cesnovar we can use something like heaptracker to try to sort out where this is going on. My guess is that we are allocating something on the stack allocator that needs to have it’s destructor called. Would the clang mem analyzer work for OpenCL stuff?

Looking at that model, where is the OpenCL stuff used at?

1 Like

My apologies if this isn’t helpful; I’m nowhere near being a developer.

I think dot products when target += student_t_lpdf(Y | nu, mu, sigma); is called move to the GPU.

In my most recent example, I call target += beta_lpdf(Y | mu * phi, (1 - mu) * phi);. There is no dot product calculation and the GPU usage is low, but memory consumption does the same gradual creep.

Oh yes sorry! That wasn’t calling you out I was asking Rok where the GPU stuff was being called here because I couldn’t remember which parts had access to it. This example is very helpful and useful!

Right now, in 2.26 its only used for _lpdf/_lpmf calls, so in the above case only for the student_t_lpdf. Data is transferred in the constructor, and parameters are to_matrix_cl’ed every iteration in the function call.

Okay then I think the issue might be in the matrix_cl<> edge for operands_and_partials. I think we just need to change the line to using partials_t = arena_t<plain_type_t<Op>>;

I think this could be a bug that was fixed after the release in Template from matrix cl by t4c1 · Pull Request #2345 · stan-dev/math · GitHub. @JLC Can you try to run the model with develop version of Stan Math to check if the issue persists?

It looks like the issue is still there.

I am on the develop branch of cmdstan 2.26.1

and pulled the latest develop branch of stan_math:

cd cmdstan/stan/lib/stan_math
git checkout develop

Then rebuilt cmdstan

make -j8 build

Re-ran the model and still see the memory consumption ticking up steadily.

Hi all,

Is there any other information I can provide or debugging I can attempt to assist with this issue?

Sorry, we havent posted here, but the leak reason was found and will be fixe shortly: Clear read/write event for the global matrix_cls · Issue #857 · stan-dev/stanc3 · GitHub

Thanks for your report!