Parallelization

Hello everyone,
I’m trying to parallelize my stan code. I used reduced_sum.

Q1: when running multi-chain multi-threading using opencl, I see no activity on GPU… Did i missed something?
Compilation:

model <- cmdstan_model(
  stan_file = './stan/fit.stan',
  force_recompile = TRUE,
  cpp_options = list(stan_opencl = TRUE, stan_threads = TRUE, stan_threads = TRUE),
  stanc_options = list("O1")
)

Sampling:

fit <- model$sample(
  data = data_list,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  threads_per_chain = 2,
  opencl_ids = c(0, 0),
  iter_warmup = 20,
  iter_sampling = 20
)

There is no error, somply i don’t see any activity on GPUs. I tried c(0, 1) and c(0, 2) (as i was not completely sure which processor was actually used by opencl), with the same result.

Q2. Is it possible to use MPI with reduce_sum, or is map_rect required? (as suggested by old posts, just wanted to be sure…)

  • Operating System: Macbook Pro, Sequioia, AMD Radeon Pro 5500M 4 Go and intel UHD Graphics 630 1536 Mo
  • CmdStan Version: 2.36.0 (with cmdstanr)

Many thanks!

1 Like

I’m not sure how O1 and use-opencl interact in the Stan compiler (@stevebronder might know better), but I would try removing O1 as a first test. It’s also possible your model isn’t using functions for which we have opencl definitions, or isn’t doing enough work per evaluation to meaningfully show up.

thank you very much for the answer. I use categorical_lpmf. I tried again without IO option, but the result is the same: i see a 200% activity on CPUs, and nothing on GPUs.

It appears we do not have any opencl code for this function. We do for categorical_logit_glm_lpmf

Thank you very much for the clarification. And do you know if reduce_sum is compatible with MPI?

I believe our only MPI support is still for map_rect

I adatpted the code with categorical_logit_glm_lpmf. Same result. Below is the code. I tried with the Device #1: Intel(R) UHD Graphics 630 and the Device #2: AMD Radeon Pro 5500M Compute Engine (btw, which one should i use?), and in both cases GPU usage was 0%. I certainly did something wrong, but cannot figure what. Any cue would be greatly appreciated!

functions {
  real partial_sum(array[] int slice_n, 
                   int start, int end,
                   array[] int Tsubj, 
                   array[,] int sample, 
                   array[,] int choice,
                   array[,,] int color, 
                   array[,,] real proba, 
                   matrix params, 
                   int T_max, 
                   int I_max) {
    
    real log_prob = 0;
    
    // Loop over the slice of data
    for (n in start:end) {
      int trials_for_subject = Tsubj[n];  // Extract number of trials for subject n
      
      for (t in 1:trials_for_subject) {  // Loop through trials for this subject
        // Skip if sample data is missing/invalid
        if (sample[n,t] < 0) continue;
        
        vector[2] evidence = rep_vector(0.0, 2);
        for (i in 1:sample[n,t]) {  // Loop through samples
          real p = proba[n, t, i];  // Probability for the current sample
          real l = log(p / (1 - p));  // Log-likelihood computation
          real log_odds = params[n,1] * l + params[n, 2];
          evidence[color[n, t, i]] += exp(params[n,3]*(i-sample[n,t])) * log_odds;  // Add to the evidence
        }
        
        // Skip if choice data is missing/invalid
        if (choice[n, t] < 0) continue;
	log_prob += categorical_logit_glm_lpmf(
        { choice[n, t] } |          // Array of length 1
        rep_matrix(1, 1, 1),        // Predictor matrix (1x1)
        rep_vector(0, 2),           // Intercept vector of length 2
        to_matrix(evidence')        // Logits as a 1x2 matrix
					       );

      }
    }
    return log_prob;
  }
}

data {
  int<lower=1> N;                 // Number of subjects
  int<lower=1> T_max;             // Maximum number of trials across all subjects
  int<lower=1> I_max;             // Maximum number of samples across all trials
  array[N] int<lower=1> Tsubj;    // Number of trials per subject
  array[N, T_max] int<lower=-1> sample;   // Number of samples per subj and trial, -1 indicates missing data
  array[N, T_max, I_max] int<lower=-1, upper=2> color; // Info item colors, -1 indicates missing data
  array[N, T_max, I_max] real<lower=-1, upper=1> proba; // Probabilities for info items, -1 indicates missing data
  array[N, T_max] int<lower=-1, upper=2> choice;   // Choices for each subject, -1 indicates missing data
}

parameters {
  vector[3] mu_pr;                // Group-level means
  vector<lower=0>[3] sigma_pr;    // Group-level SDs
  matrix[N, 3] param_raw;         // Subject-level raw parameters
}

transformed parameters {
  matrix[N,5] params;     // Subject-level parameters
  
  // Transform subject-level parameters
  for (n in 1:N) {
    params[n, 1] = Phi_approx(mu_pr[1] + sigma_pr[1] * param_raw[n, 1]) * 10;  // alpha (slope)
    params[n, 2] = mu_pr[2] + sigma_pr[2] * param_raw[n, 2];                   // beta (intercept)
    params[n, 3] = Phi_approx(mu_pr[3] + sigma_pr[3] * param_raw[n, 3]) * 10;  // lambda (discount)
  }
}

model {
  // Priors
  int grainsize = 1;     // grainsize for parallelization
  mu_pr ~ std_normal();
  sigma_pr ~ std_normal();
  to_vector(param_raw) ~ std_normal();
  
  // Create an array with indices that will be sliced
  array[N] int indices;
  for (n in 1:N) {
    indices[n] = n;
  }
  
  // Likelihood using reduce_sum with the correct signature
  target += reduce_sum(partial_sum, indices, grainsize,
                      Tsubj, sample, choice, color, proba, params,
		       T_max, I_max);
}

generated quantities {
  array[N, T_max] real log_lik;
  
  // Group-level transformed parameters
  real mu_alpha = Phi_approx(mu_pr[1]) * 10;
  real mu_beta = mu_pr[2];
  real mu_lambda = Phi_approx(mu_pr[3])*10;
  
  // Log-likelihood computation
  for (n in 1:N) {
    for (t in 1:Tsubj[n]) {
      // Initialize to missing value indicator
      log_lik[n, t] = -999; 
      
      // Skip if sample data is missing
      if (sample[n,t] < 0) continue;
      
      // Skip if choice data is missing
      if (choice[n, t] < 0) continue;
      
      vector[2] evidence = rep_vector(0.0, 2);
      for (i in 1:sample[n,t]) {
        // Skip if color or proba data is missing
        if (color[n, t, i] < 0 || proba[n, t, i] < 0) continue;
        
        real p = proba[n, t, i];
        real l = log(p / (1 - p));
        real log_odds = params[n,1] * l + params[n, 2];
        evidence[color[n, t, i]] += exp(params[n,3]*(i-sample[n,t])) * log_odds;
      }
      vector[2] prob = softmax(evidence);
      log_lik[n, t] = categorical_logit_lpmf(choice[n, t] | evidence);
    }
  }
}

Oups. A categorical_logit_lpmf was still in the generated quantities part. I removed it, used Device #2: AMD Radeon Pro 5500M Compute Engine , and eventually got an activity about 10% on GPU on each chain, which I guess suggests that it eventually works. Many thanks for the help!