How to use reduce_sum in for loops?

Hi, there. I’m a beginner of stan. Now I face a problem and I have no idea to figure it out.

I have a model, looks like:

data{
  int<lower=0> N;//number of participants
  int<lower=0> T;//number of trials
  array[N,T] int<lower=0,upper=1> Z; //participants' choices
  array[N,T] real<lower=0> S; //conditions
  array[N,T] real<lower=0> O; //conditions
  array[N,T] real<lower=0> P; //conditions
}
model {
// omit something unimportant 

for (i in 1:N){
    real X;
    real Y;
    
    for (t in 1:T){
      X = alpha[i] * S[i,t] + beta[i] * O[i,t] - gamma[i] * P[i,t];
      Y = 0;
      
      Z[i,t] ~ bernoulli_logit(IT[i] * (X-Y));
    }
  }
}

I want to rewrite these code by using reduce_sum. But I have no idea to solve these nested loop. Is there anyone could give me some advices? Thanks a lot~

Update: Here is the code I rewrite for multithread, unfortuntely, the output is different with single thread code.

Here is my code with reduce_sum:

functions {
  real partial_sum_lpmf(array[,] int Z_slice, int start, int end,
                   vector alpha, vector beta, vector gamma, vector IT,
                   array[,] real S, array[,] real O, array[,] real P
                   ) {
    real partial_log_lik = 0;

    int N;
    for (n in 1:N) {
      real X;
      real Y;
      
      for (t in 1:end-start+1) {
        X = alpha[n] * S[n, start+t-1] + beta[n] * O[n, start+t-1] - gamma[n] * P[n, start+t-1];
        Y = 0;
        partial_log_lik += bernoulli_logit_lupmf(Z_slice[n,t] | IT[n] * (X - Y));
      }
    }
    return partial_log_lik;
    
  }
}

model {
// omit something unimportant

target += reduce_sum (partial_sum_lpmf, Z, 1, alpha, beta, gamma, IT, S, O, P)
}

I used summarise_draws by library(posterior), here is the results

variable        mean    median      sd    mad      q5    q95  rhat ess_bulk
   <chr>          <num>     <num>   <num>  <num>   <num>  <num> <num>    <num>
 1 lp__       -154.     -107.     131.    12.9   -524.   -89.0   1.39     17.9
 2 mu_p[1]      -0.0226    0.0627   0.786  0.622   -1.31   1.27  1.42     43.2
 3 mu_p[2]      -0.0594   -0.113    0.700  0.216   -1.25   1.28  1.51    224. 
 4 mu_p[3]       0.160    -0.0182   0.918  0.599   -1.24   1.83  1.60     23.2
 5 mu_p[4]      -0.208    -0.135    1.27   1.16    -2.42   1.87  1.39     18.0
 6 sigma[1]      1.59      0.0781  20.6    2.83   -11.9   13.3   1.30     58.8
 7 sigma[2]      1.40     -0.862   22.9    0.687  -11.7   15.8   1.64    140. 
 8 sigma[3]     -0.454    -1.68    37.0    3.15   -16.2   15.5   1.55     64.6
 9 sigma[4]     -2.67     -0.222   25.8    3.42   -18.1   14.0   1.39     73.6
10 alpha_p[1]   -0.0108   -0.0174   0.996  1.20    -1.69   1.39  1.32     19.7

rhats are larger than 1.01 means the check isn’t pass.

P.S. The sample code I used in cmdstanr :

fit_mcmc <- fit_output_reduce_sum$sample(
    data = dataList,
    seed = 123,
    chains = 4,
    parallel_chains = 4,
    init = 1,
    iter_sampling = 500,
    threads_per_chain = 4,
    iter_warmup = 500
)

P.S.S The results from stan with reduce_sum

# This is the results by using reduce_sum
   variable    mean  median    sd   mad      q5    q95 rhat ess_bulk ess_tail
 lp__       -107.20 -106.79 10.56 10.26 -125.17 -90.32 1.01      620      708
 mu_p[1]      -0.02    0.01  1.02  1.02   -1.70   1.63 1.00     3252     1423
 mu_p[2]       0.01    0.02  0.98  0.97   -1.60   1.66 1.00     3641     1523
 mu_p[3]      -0.01   -0.02  0.96  0.94   -1.60   1.61 1.00     3595     1667
 mu_p[4]       0.01    0.03  1.00  0.94   -1.64   1.71 1.00     3943     1383
 sigma[1]      3.18    0.08 28.96  6.32  -20.45  27.50 1.03      271       67
 sigma[2]      3.68   -0.06 32.18  6.23  -22.85  41.77 1.01      886      291
 sigma[3]      0.65   -0.31 52.27  7.58  -27.59  29.86 1.01      518      293
 sigma[4]     -4.22   -0.25 36.33  7.83  -33.41  23.59 1.02      440      128
 alpha_p[1]    0.00   -0.01  0.96  0.95   -1.54   1.57 1.00     3009     1214
# This is the results without reduce_sum
  variable    mean  median     sd   mad      q5    q95 rhat ess_bulk ess_tail
 lp__       -200.75 -106.65 172.62 16.79 -553.88 -88.16 1.70        6       10
 mu_p[1]      -0.03    0.08   0.45  0.36   -0.85   0.50 3.28        4       14
 mu_p[2]      -0.13   -0.12   0.09  0.08   -0.32   0.00 1.36        9       12
 mu_p[3]       0.33   -0.02   0.83  0.22   -0.50   1.88 2.17        5       12
 mu_p[4]      -0.43   -0.29   1.46  1.56   -2.48   2.17 1.90        5       15
 sigma[1]      0.00    0.07   1.48  1.89   -2.23   1.83 3.24        4       14
 sigma[2]     -0.88   -0.94   0.26  0.29   -1.26  -0.41 2.02        5       16
 sigma[3]     -1.55   -1.84   1.56  1.49   -3.50   0.97 2.65        4       11
 sigma[4]     -1.13   -0.21   2.41  1.25   -5.91   1.65 1.74        6       11
 alpha_p[1]   -0.02   -0.03   1.03  1.44   -1.79   1.24 1.98        5       41

If you are looking for a speed up, I would transform your S, O, and P data matrices into one large matrices of size N * T by 3 (so you are flattening each of them into vectors and combining the three into a matrix), and change alpha, beta, and gamma parameters (which would be defined in your parameter block) into a matrix of i participants by 3.

Then, you can use the more efficient bernoulli_logit_glm, see 15.3 Bernoulli-logit generalized linear model (Logistic Regression) | Stan Functions Reference. (alpha for the function is just a vector of zeros since you didn’t define your alpha as an intercept).

If that still isn’t fast enough, then you can slice over the parameters by participants using reduce_sum.

Thank for your advice. I want to use reduce_sum to handle it, and I want to split the participants to parallizing the N (participants). Could you give me some advices? Tks~

Sure. Ok, so based on your initial code structure (I haven’t had time to read through your other updates), you basically move everything inside the outside for loop (i in 1:N) into a partial sum function, and slice the index i 1:N.

Sort of like an apply in R.

You’ll need to follow the syntax changes needed by partial sum in the docs.

Thank for your time and your reply! I have tried your suggestion and the reduce_sum worked. However, the speed up a little (10% perhaps, with AMD 7950, 16 cores and 32 threads.)

I posted my full code before I modified. Appreciate any suggestion to speed up this code!

It’s a code for UG game, and modified from hBayesDM code: hBayesDM/commons/stan_files/ug_delta.stan at develop · CCS-Lab/hBayesDM · GitHub

data {
  int<lower=0> N; //number of participants
  int<lower=0> T;// number of trials for each participants
  array[N,T] int<lower=0,upper=1> Choice;
  array[N,T] real<lower=0> benefitS;
  array[N,T] real<lower=0> benefitO;
  array[N,T] real<lower=0> lost;
}


parameters {
  vector[4] mu_p;
  vector[4] sigma;
  
  vector[N] alpha;
  vector[N] beta;
  vector[N] gamma;  
  vector[N] IT;
}


model {
  mu_p ~ normal(0,1);
  sigma ~ cauchy(0,5);
  
  alpha_p ~ normal(0,1);
  beta_p ~ normal(0,1);
  gamma_P ~ normal(0,1);
  IT_p ~ normal(0,1);
  
  for (i in 1:N){
    real accept;
    real reject;
    
    for (t in 1:T){
      accept = alpha[i]*benefitS[i,t]+beta[i]*benefitO[i,t]-gamma[i]*lost[i,t];
      reject=0;
      
      BribeChoice[i,t]~bernoulli_logit(IT[i]*(accept-reject));
    }
  }
}


generated quantities{
  real<lower=-10,upper=10> mu_alpha; 
  real<lower=-10,upper=10> mu_beta; 
  real<lower=-10,upper=10> mu_gamma; 
  real<lower=0,upper=10> mu_IT;
  
  array[N,T] real accept;
  
  array[N]real log_lik;
  
  real <lower=0,upper=1> prob;
  array[N,T] int BG_pred;
  
  mu_alpha = Phi_approx(mu_p[1])*20-10;
  mu_beta = Phi_approx(mu_p[2])*20-10;
  mu_gamma = Phi_approx(mu_p[3])*20-10;
  mu_IT = Phi_approx(mu_p[4])*10;  
  
  
  accept=rep_array(-9999.0,N,T);
  BG_pred=rep_array(-999,N,T);
  
  
  {
    for (i in 1:N){
      log_lik[i]=0;
      for (t in 1:T){
        accept[i,t]=alpha[i]*benefitS[i,t]+beta[i]*benefitO[i,t]-gamma[i]*lost[i,t];
      
        log_lik[i]=log_lik[i]+bernoulli_logit_lpmf(Choice[i,t]|IT[i]*(accept[i,t]-0));
        
        prob=inv_logit(IT[i]*(accept[i,t]-0));
        
        BG_pred[i,t]=bernoulli_rng(prob);
      }
    }
  }
  
}

I’m a little confused on what code you are referring to at this point as your latest post does not seem to include what you said you modified.

For speed up, my earlier suggestion should give you more, then if you wanted to go further reduce sum may make more sense.

Also, speed up on reduce sum depends on several factors, including the way you code it and the way you parameterize the slicing over cores and number of cores, etc. While your machine is 16 cores, it will make a large difference whether those are physical cores or virtual cores.

Oh! Sorry for forget post my modified code:

functions {
    real partial_sum (array[,] int Choice,
                           int start, int end,
                           array[,] real benefitS, array[,] real benefitO, array[,] real lost,
                           vector alpha, vector beta, vector gamma, vector IT,int T) {

            real partial_log_lik  = 0;
            
            for (n in 1:end-start+1) {
                real accept;
                real reject;
                for (t in 1:T) {
                    
                    accept = alpha[start + n - 1]*benefitS[start + n - 1,t]+beta[start + n - 1]*benefitO[start + n - 1,t]-gamma[start + n - 1]*lost[start + n - 1,t];
                    reject = 0;
                    partial_log_lik += bernoulli_logit_lpmf(Choice[n,t] | IT[start + n - 1] * (U_accept - U_reject));
                }
            }
            return partial_log_lik;           
    }
}

//[skipped some part with above]
model {
  mu_p ~ normal(0,1);
  sigma ~ cauchy(0,5);
  
  alpha_p ~ normal(0,1);
  beta_p ~ normal(0,1);
  gamma_P ~ normal(0,1);
  IT_p ~ normal(0,1);
  

  target += reduce_sum_static(partial_sum,Choice,1,
                       benefitS,benefitO,lost,
                       alpha, beta, gamma, IT,T);
}

Apologize for my careless and thank for your patience!