Parallelization

tgajdos · March 19, 2025, 2:13pm

Hello everyone,
I’m trying to parallelize my stan code. I used reduced_sum.

Q1: when running multi-chain multi-threading using opencl, I see no activity on GPU… Did i missed something?
Compilation:

model <- cmdstan_model(
  stan_file = './stan/fit.stan',
  force_recompile = TRUE,
  cpp_options = list(stan_opencl = TRUE, stan_threads = TRUE, stan_threads = TRUE),
  stanc_options = list("O1")
)

Sampling:

fit <- model$sample(
  data = data_list,
  seed = 123,
  chains = 4,
  parallel_chains = 4,
  threads_per_chain = 2,
  opencl_ids = c(0, 0),
  iter_warmup = 20,
  iter_sampling = 20
)

There is no error, somply i don’t see any activity on GPUs. I tried c(0, 1) and c(0, 2) (as i was not completely sure which processor was actually used by opencl), with the same result.

Q2. Is it possible to use MPI with reduce_sum, or is map_rect required? (as suggested by old posts, just wanted to be sure…)

Operating System: Macbook Pro, Sequioia, AMD Radeon Pro 5500M 4 Go and intel UHD Graphics 630 1536 Mo
CmdStan Version: 2.36.0 (with cmdstanr)

Many thanks!

WardBrian · March 19, 2025, 3:55pm

I’m not sure how O1 and use-opencl interact in the Stan compiler (@stevebronder might know better), but I would try removing O1 as a first test. It’s also possible your model isn’t using functions for which we have opencl definitions, or isn’t doing enough work per evaluation to meaningfully show up.

tgajdos · March 19, 2025, 4:31pm

thank you very much for the answer. I use categorical_lpmf. I tried again without IO option, but the result is the same: i see a 200% activity on CPUs, and nothing on GPUs.

WardBrian · March 19, 2025, 4:35pm

It appears we do not have any opencl code for this function. We do for categorical_logit_glm_lpmf

tgajdos · March 19, 2025, 4:58pm

Thank you very much for the clarification. And do you know if reduce_sum is compatible with MPI?

WardBrian · March 19, 2025, 5:02pm

I believe our only MPI support is still for map_rect

tgajdos · March 19, 2025, 8:36pm

I adatpted the code with categorical_logit_glm_lpmf. Same result. Below is the code. I tried with the Device #1: Intel(R) UHD Graphics 630 and the Device #2: AMD Radeon Pro 5500M Compute Engine (btw, which one should i use?), and in both cases GPU usage was 0%. I certainly did something wrong, but cannot figure what. Any cue would be greatly appreciated!

functions {
  real partial_sum(array[] int slice_n, 
                   int start, int end,
                   array[] int Tsubj, 
                   array[,] int sample, 
                   array[,] int choice,
                   array[,,] int color, 
                   array[,,] real proba, 
                   matrix params, 
                   int T_max, 
                   int I_max) {
    
    real log_prob = 0;
    
    // Loop over the slice of data
    for (n in start:end) {
      int trials_for_subject = Tsubj[n];  // Extract number of trials for subject n
      
      for (t in 1:trials_for_subject) {  // Loop through trials for this subject
        // Skip if sample data is missing/invalid
        if (sample[n,t] < 0) continue;
        
        vector[2] evidence = rep_vector(0.0, 2);
        for (i in 1:sample[n,t]) {  // Loop through samples
          real p = proba[n, t, i];  // Probability for the current sample
          real l = log(p / (1 - p));  // Log-likelihood computation
          real log_odds = params[n,1] * l + params[n, 2];
          evidence[color[n, t, i]] += exp(params[n,3]*(i-sample[n,t])) * log_odds;  // Add to the evidence
        }
        
        // Skip if choice data is missing/invalid
        if (choice[n, t] < 0) continue;
	log_prob += categorical_logit_glm_lpmf(
        { choice[n, t] } |          // Array of length 1
        rep_matrix(1, 1, 1),        // Predictor matrix (1x1)
        rep_vector(0, 2),           // Intercept vector of length 2
        to_matrix(evidence')        // Logits as a 1x2 matrix
					       );

      }
    }
    return log_prob;
  }
}

data {
  int<lower=1> N;                 // Number of subjects
  int<lower=1> T_max;             // Maximum number of trials across all subjects
  int<lower=1> I_max;             // Maximum number of samples across all trials
  array[N] int<lower=1> Tsubj;    // Number of trials per subject
  array[N, T_max] int<lower=-1> sample;   // Number of samples per subj and trial, -1 indicates missing data
  array[N, T_max, I_max] int<lower=-1, upper=2> color; // Info item colors, -1 indicates missing data
  array[N, T_max, I_max] real<lower=-1, upper=1> proba; // Probabilities for info items, -1 indicates missing data
  array[N, T_max] int<lower=-1, upper=2> choice;   // Choices for each subject, -1 indicates missing data
}

parameters {
  vector[3] mu_pr;                // Group-level means
  vector<lower=0>[3] sigma_pr;    // Group-level SDs
  matrix[N, 3] param_raw;         // Subject-level raw parameters
}

transformed parameters {
  matrix[N,5] params;     // Subject-level parameters
  
  // Transform subject-level parameters
  for (n in 1:N) {
    params[n, 1] = Phi_approx(mu_pr[1] + sigma_pr[1] * param_raw[n, 1]) * 10;  // alpha (slope)
    params[n, 2] = mu_pr[2] + sigma_pr[2] * param_raw[n, 2];                   // beta (intercept)
    params[n, 3] = Phi_approx(mu_pr[3] + sigma_pr[3] * param_raw[n, 3]) * 10;  // lambda (discount)
  }
}

model {
  // Priors
  int grainsize = 1;     // grainsize for parallelization
  mu_pr ~ std_normal();
  sigma_pr ~ std_normal();
  to_vector(param_raw) ~ std_normal();
  
  // Create an array with indices that will be sliced
  array[N] int indices;
  for (n in 1:N) {
    indices[n] = n;
  }
  
  // Likelihood using reduce_sum with the correct signature
  target += reduce_sum(partial_sum, indices, grainsize,
                      Tsubj, sample, choice, color, proba, params,
		       T_max, I_max);
}

generated quantities {
  array[N, T_max] real log_lik;
  
  // Group-level transformed parameters
  real mu_alpha = Phi_approx(mu_pr[1]) * 10;
  real mu_beta = mu_pr[2];
  real mu_lambda = Phi_approx(mu_pr[3])*10;
  
  // Log-likelihood computation
  for (n in 1:N) {
    for (t in 1:Tsubj[n]) {
      // Initialize to missing value indicator
      log_lik[n, t] = -999; 
      
      // Skip if sample data is missing
      if (sample[n,t] < 0) continue;
      
      // Skip if choice data is missing
      if (choice[n, t] < 0) continue;
      
      vector[2] evidence = rep_vector(0.0, 2);
      for (i in 1:sample[n,t]) {
        // Skip if color or proba data is missing
        if (color[n, t, i] < 0 || proba[n, t, i] < 0) continue;
        
        real p = proba[n, t, i];
        real l = log(p / (1 - p));
        real log_odds = params[n,1] * l + params[n, 2];
        evidence[color[n, t, i]] += exp(params[n,3]*(i-sample[n,t])) * log_odds;
      }
      vector[2] prob = softmax(evidence);
      log_lik[n, t] = categorical_logit_lpmf(choice[n, t] | evidence);
    }
  }
}

tgajdos · March 20, 2025, 7:20am

Oups. A categorical_logit_lpmf was still in the generated quantities part. I removed it, used Device #2: AMD Radeon Pro 5500M Compute Engine , and eventually got an activity about 10% on GPU on each chain, which I guess suggests that it eventually works. Many thanks for the help!

Topic		Replies	Views
Measuring and comparing computational performance in Stan with different compilation alternatives. Using reduce_sum does not bring any advantage? General performance	30	1434	March 26, 2021
OpenCL & threading supported at the same time? Developers	9	1375	March 11, 2022
Four chains vs four jobs General cmdstan	28	216	June 19, 2024
Extending reduce_sum to use MPI Developers	5	486	June 1, 2021
Correct way to use MPI with cmdstanpy Modeling	9	712	August 29, 2020

Parallelization

Related topics