Can't get reduce_sum to help with model runtime

Hi,

This is my first attempt at trying to implement reduce_sum to improve my model’s run time. I have pasted the model code below in case anyone is kind enough to take a look, and I’d be happy to provide more context on the model itself if needed. The reduce_sum model seems to be working correctly (results are identical to the non-reduce_sum version which I implemented first) but I am not seeing any improvements in run time. I read through several posts here on the forum and I wonder whether the way I have structured the model is creating too much overhead, but unfortunately I am not experienced enough to be able to figure it out. I wonder if someone with much more experience than me could spot the culprit just by eyeballing the code (if indeed there is a culprit, it’s also possible that this model simply does not lend itself to the use of reduce_sum?). Note that I am currently slicing over 30 years of data and I am using two chains, 4 cores per chain, and a grainsize = 1. Grainsizes of 2-4 did not make a difference. With this little cores I am not expecting huge performance improvements, but I was expecting at least a little. I could have access to more cores but I would first like to make sure that I am not doing something fundamentally clumsy in how I wrote the code given my lack of experience. Any insight would be greatly appreciated, and again I’d be happy to provide more details.
Thanks!

My current setup:
Windows
Rstan 2.32.3
Stan 2.26.1
Cmdstanr 0.7.0
CmdStan 2.33.1

functions {

//1. Function that calculates model predictions (lyhat) for one year

   array[] vector year_model(
                       matrix pet_ac_mat,   //For each year, this has size 79000*20 
                       int nnode,           //Equal to 86798
                       vector pcp_dir,      //For each year, this has size 79000     
                       vector param,         //Vector of 20 shared parameters
                       int ncat,          //Equal to 79000
                       int numobs,        //Vector with 30 elements, each with the number of observations for the dependent variable in each yeare       
                       array[] int from_cd, //This vector has size 79000
                       array[] int to_cd,   //This vector has size 79000
                       vector depvar,           //Dependent variable. For each year, this vector has size varying between 200 and 200 
                       array[] real Dfac,       //This vector has size 79000 
                       array[] real Id,         //This vector has size 79000
                       array[] int iftr,      //This vector has size 79000   
                       array[] real transf) {   //This vector has size 79000
 
     array[nnode] real pnode;
     vector[numobs] lyhat;     
     vector[ncat] incremental;        
     int fnode;
     int tnode;
     int ie;
     real cat_total; 
     real cat_total_scaled; 
     vector[numobs] ldepvar;

     ie = 0;

     for (j in 1:nnode) {   
            pnode[j] = 0.0;    
     }  

    incremental = pcp_dir - pet_ac_mat * param;

    for (i in 1:ncat) {  #ncat = 79000
         
       fnode = from_cd[i];                                                          
       tnode = to_cd[i];                           
         
       cat_total = incremental[i] + transf[i] * pnode[fnode];  

       if(cat_total<0) {
          cat_total = 0.0;
         }

       if(depvar[i] != 0.0 ) {

          if(Id[i] == 1){ 
             cat_total_scaled = incremental[i] * Dfac[i];
          } else {
            if(Id[i] == 2){ 
               cat_total_scaled = rchpld * Dfac[i];
               } else { 
                 cat_total_scaled = (cat_total - incremental[i]) * Dfac[i];
                 }
             }

          if(i == 79826){ 
             cat_total_scaled = cat_total_scaled + pnode[2006];
            }

          ie = ie + 1; 
          if (cat_total_scaled > 0) {
            lyhat[ie] = log(cat_total_scaled);    
          } else {
            lyhat[ie] = log(1.0);    
          }

          ldepvar[ie] = log(depvar[i]);

          } // end depvar if

       pnode[tnode] = pnode[tnode] + iftr[i] * cat_total;   
       
     }  // end reach loop

     return({ lyhat, ldepvar });

  } //End of year_model function


//2. Partial_sum function that calculates the likelihood contribution for years m = start ... end
//   by using function above

real year_lpdf(array[] matrix pet_ac_mat,        
                    int start, int end,  
                    int nnode,
                    matrix pcp_dir,      
                    vector param, 
                    int ncat,
                    array[] int numobs,        
                    array[] int from_cd,
                    array[] int to_cd,
                    matrix depvar,       
                    array[] real Dfac,      
                    array[] real Id,      
                    array[] int iftr,    
                    array[] real transf,
                    real bsigma
                    ) {
    
    real log_lik = 0.0;                     
    
    for (m in start:end) {            	
         int m_slice = m - start + 1; 	//index within the slice (re-starts at 1 for each new slice)
         array[2] vector[numobs[m]] local = year_model( 
          pet_ac_mat_slice[m_slice],
          nnode,      
          col(pcp_dir, m),   
          param,
          ncat,
          numobs[m],
          from_cd,
          to_cd,
          col(depvar, m),   
          Dfac,
          Id,
          iftr,
          transf);  
      
     vector[numobs[m]] lyhat;
     vector[numobs[m]] ldepvar;
     lyhat = local[1];
     ldepvar = local[2];

      log_lik += normal_lpdf(ldepvar | lyhat, bsigma);
    }

    return(log_lik);
}

} //End of functions block


   data { 
     int<lower=0> ncat;         	//79000
     int<lower=0> nnode;          	//86798 
     int nyears;                        //30
     int<lower=0> numobs_max;           //448
     array[nyears] int numobs;       	//number of observations in each year
     array[ncat] int iftr;      	//size of array = 79000
     matrix[ncat, nyears] pcp_dir; 	//size of matrix = 79000*30
     array[nyears] matrix[ncat, 20] pet_ac_mat;	//array of 30 matrices, each with size = 79000*20
     array[ncat] int from_cd; 	//size of array = 79000
     array[ncat] int to_cd;   	//size of array = 79000
     array[ncat] real transf;    	//size of array = 79000
     matrix[ncat, nyears] depvar; 	//size of matrix = 79000*30	
     array[ncat] real Dfac;           //size of array = 79000
     array[ncat] real Id;             //size of array = 79000
   }

   parameters {
     vector<lower=0, upper=10> [20]  param;
     real<lower=0, upper=5> bsigma;
   }

   model {
     param ~ normal(0,5);
     bsigma ~ normal(0,5); 
      
     target += reduce_sum(
      year_lpdf, 
      pet_ac_mat, 
      1,
      nnode,
      pcp_dir,
      param,
      ncat,
      numobs,
      from_cd,
      to_cd,
      depvar,
      Dfac,
      Id,
      iftr,
      transf,
      bsigma);   

} // End of model statement

Did you compile the model using open mpi with the command make STAN_THREADS=TRUE my_folder/my_model?

I am compiling the model using the following command in R:
mod ← cmdstan_model(“path_to_model”, cpp_options = list(stan_threads = TRUE)).
I also manually added the line:
STAN_THREADS=true
to the .cmdstan\cmdstan-2.33.1\make\local file

Right, sorry, you’re using Rstan on Windows. I’m having some trouble parsing your model code, which can be particularly hard to read when there is a partial sum. But I know from my own programing of a parallelized code, I kept all of the work of the function inside of the likelihood function, and used partial sum only to hand dynamic slices of data to the likelihood function. Here is my code:

functions {

 //Custom quantile function for lognormal distribution
 matrix log_normal_qf(vector x, vector m, vector s) {

  //takes a vector sampling the x-axis in terms of cumulative density, plus vectors of mu and sigma parameters
  return exp(rep_matrix(s' * sqrt(2), size(x)) .* rep_matrix(inv_erfc(-2 * x + 2), size(m)) + rep_matrix(m', size(x)));

 }

 //Likelihood model for CASANDRE
 real casandre_log(array[] matrix resps, vector guess, matrix sds, matrix sm, vector nconf, vector pconf) {

  //declare and initialize log likelihood variable
  real llhC;
  llhC = 0;

  //loop over participants to calculate log likelihood
  for (n in 1:size(resps)) {

   //declare the index-sizing variable
   int m[rows(resps[n])];

   //build the logical vector of non-zero trials
   for (tt in 1:size(m)) {
      m[tt] = sum(resps[n][tt]) == 1;
   }

   //declare the likelihood variable
   matrix[rows(sm),4] lhC;

   //check if subject has any non-zero data before running
   if (sum(resps[n]) != 0) {

    //declare incrementing and index variables
    int t;
    int ind[sum(m)];

    //initialize incrementing variable
    t = 1;

    //declare and calculate mu parameters for normal pdfs
    matrix[rows(sm),rows(sds)] avgs;
    avgs = rep_matrix(sm[,n],rows(sds)).*rep_matrix(sds[,n]',rows(sm));
  
    //loop over trials
    for (tr in 1:rows(sm)){

     //check if trial has any non-zero data before running
     if (sum(resps[n][tr]) != 0) {

      //declare sample vector
      matrix[3,rows(sds)] raws;

      //sample the cumulative density of normals along the transformed x-axis for each confidence bin
      for (rws in 1:rows(sds)) {
       raws[1,rws] = normal_cdf(-nconf[n],avgs[tr,rws],sds[rws,n]);
      }
      for (rws in 1:rows(sds)) {
       raws[2,rws] = normal_cdf(0,avgs[tr,rws],sds[rws,n]);
      }
      for (rws in 1:rows(sds)) {
       raws[3,rws] = normal_cdf(pconf[n],avgs[tr,rws],sds[rws,n]);
      }

      //declare the cumulative likelihood variable
      vector[3] ratiodist;

      //take the mean of the sampled densities
      ratiodist[1] = mean(raws[1]);
      ratiodist[2] = mean(raws[2]);
      ratiodist[3] = mean(raws[3]);

      //implement a lapse rate parameter to absorb unmodeled noise, and calculate likelihoods of each choice possibility in the trial
      lhC[tr,1] = ((guess[n]/4) + ((1-guess[n])*ratiodist[1]));
      lhC[tr,2] = ((guess[n]/4) + ((1-guess[n])*(ratiodist[2]-ratiodist[1])));
      lhC[tr,3] = ((guess[n]/4) + ((1-guess[n])*(ratiodist[3]-ratiodist[2])));
      lhC[tr,4] = ((guess[n]/4) + ((1-guess[n])*(1-ratiodist[3])));

      //add this trial to the index
      ind[t] = tr;

      //increment the index
      t = t + 1;

     }

    }

    //multiply the choice matrix by the log of the likelihood matrix
    llhC += sum(columns_dot_product(resps[n][ind], log(lhC[ind])));

   }
  
  }

  //return the log likelihood for all data given the sampled parameters
  return llhC;
  
 }

 //Partial sum function
 real partial_sum_casandre_log(array[] matrix slice_n_resps, int start, int end, vector guess, matrix sds, matrix sm, vector nconf, vector pconf) {

 //dynamically slice data into the log likelihood function
 return casandre_log(slice_n_resps, guess[start:end], sds[:,start:end], sm[:,start:end], nconf[start:end], pconf[start:end]);

 }
  
}
data {

 //declare the number of conditions, number of contrast levels, number of subjects, and the number of trials in the largest unpadded data set
 int ncond;
 int ncont;
 int ns;
 int nt;

 //declare the response matrix (one subject per array)
 array[ncond,ncont,ns] matrix[nt,4] resps;

 //declare the orientation matrix (ever subject in each array)
 array[ncond,ncont] matrix[nt,ns] oris;

 //declare the number of x-axis samples and the vector sampling the x-axis in terms of cumulative density
 int sampn;
 vector[sampn] sampx;

}
parameters {

 //Parameters
 vector<lower=0,upper=1>[ns] guess; //lapse rate
 matrix<lower=0>[ncont,ns] sens; //stimulus sensitivities
 matrix[ncont,ns] crit; //decision criteria
 matrix<lower=0>[ncond,ns] meta; //meta-uncertainties
 matrix<lower=0>[ncond,ns] nconf; //negative confidence criteria
 matrix<lower=0>[ncond,ns] pconf; //positive confidence criteria

 //Hyperparameters
 vector[ncont] ssm; //mu hyperparameters of each contrast level's stimulus sensitivity lognormal prior
 vector<lower=0>[ncont] sss; //sigma hyperparameters of each contrast level's stimulus sensitivity lognormal prior
 vector[ncont] scm; //mu hyperparameters of each contrast level's decision criterion normal prior
 vector<lower=0>[ncont] scs; //sigma hyperparameters of each contrast level's s decision criterion normal prior
 vector[ncond] mum; //mu hyperparameters of each condition's meta-uncertainty lognormal prior
 vector<lower=0>[ncond] mus; //sigma hyperparameters of each condition's meta-uncertainty lognormal prior
 vector[ncond] nccm; //mu hyperparameters of each condition's negative confidence criterion lognormal prior
 vector<lower=0>[ncond] nccs; //sigma hyperparameters of each condition's negative confidence criterion lognormal prior
 vector[ncond] pccm; //mu hyperparameters of each condition's positive confidence criterion lognormal prior
 vector<lower=0>[ncond] pccs; //sigma hyperparameters of each condition's positive confidence criterion lognormal prior

}
model {

 //Hyperpriors
 ssm  ~ normal(0,1);
 sss  ~ lognormal(0,1);
 scm  ~ normal(0,1);
 scs  ~ lognormal(0,1);
 mum  ~ normal(0,1);
 mus  ~ lognormal(0,1);
 nccm  ~ normal(0,1);
 nccs  ~ lognormal(0,1);
 pccm ~ normal(0,1);
 pccs ~ lognormal(0,1);

 //Priors
 guess  ~ beta(1,197.0/3.0);

 for (cond in 1:ncond) {
  meta[cond]   ~ lognormal(mum[cond],mus[cond]);
  nconf[cond]   ~ lognormal(nccm[cond],nccs[cond]);
  pconf[cond] ~ lognormal(pccm[cond],pccs[cond]);
 }

 for (cont in 1:ncont) {
  sens[cont] ~ lognormal(ssm[cont],sss[cont]);
  crit[cont] ~ normal(scm[cont],scs[cont]);
 }

 //Likelihood model (with local variable calculations)
 for (cond in 1:ncond) {

  //declare the transformed x-axis matrices
  array[ncond] matrix[sampn,ns] xtrans;
  array[ncond] matrix[sampn,ns] sds;

  //calculate the transformed x-axis matrices
  xtrans[cond] = log_normal_qf(sampx,-0.5*log1p(meta[cond]'.^2),sqrt(log1p(meta[cond]'.^2)));
  sds[cond] = 1./xtrans[cond];

  for (cont in 1:ncont) {

   //check if condition and contrast level combination has non-zero data
   if (sum(abs(oris[cond][cont])) != 0) {

    //declare matrices for calculating each contrast level's transformed data
    array[ncont] matrix[nt,ns] om;
    array[ncont] row_vector[ns] cm;
    array[ncont] matrix[nt,ns] sm;

    //orientations and decision criteria in terms of standard deviations
    om[cont] = oris[cond][cont].*rep_matrix(sens[cont],nt);
    cm[cont] = crit[cont].*sens[cont];
    
    //transform the data
    sm[cont] = om[cont]-rep_matrix(cm[cont],nt);

    //hand the data and parameters to the reduce sum wrap function for dynamic within-chain parallelization
    target += reduce_sum(partial_sum_casandre_log, resps[cond][cont], 1, guess, sds[cond], sm[cont], nconf[cond]', pconf[cond]');

   }

  }

 }

}

I mostly just want to draw your attention to this:

 //Partial sum function
 real partial_sum_casandre_log(array[] matrix slice_n_resps, int start, int end, vector guess, matrix sds, matrix sm, vector nconf, vector pconf) {

 //dynamically slice data into the log likelihood function
 return casandre_log(slice_n_resps, guess[start:end], sds[:,start:end], sm[:,start:end], nconf[start:end], pconf[start:end]);

 }

Also, it’s kind of hard for me to tell, but it looks like you might be reading hard-coded indices into your reduce_sum function, which may not work if the indices don’t match the sizes of the data slices being read in. For those instances where the data being sliced will have a mismatch with your indices, you can recode your indices to be flexible with the size, rows, or cols functions (see my likelihood function above for examples)

is a grainsize of just 1 really the right choice here? The grain size controls how large (roughly) the partial sums should be. I’d pick it as a starting point as N / (2* cores) with N being the number of units you sum your likelihood over.

… and you should look into profiling things, obviously…the Stan blog has some bits on this.

Thanks for your response. As far as I can tell I used a similar approach to what you show in your code (thanks for sharing!!) in that I kept all of the calculations in a function that I called year_model and than simply used the partial sum function (which I called year_lpdf) to pass slices of data to year_model. But maybe I misunderstood your comment?
I have been double checking my indices inside the reduce_sum function and as far as I can tell they work correclty. One question: am I right in thinking that if there was an issue with indices inside the reduce_sum function, the model would simply not work, or not give expected results? I am asking because my model does work, and does provide expected results, there’s simply no change in performance. Thanks for your help with this!
P.S. I provided a bit more context about my model in my response to wds15, in case you are so kind to take a look.

Thanks you for responding. First I’d like to preface my message by saying that I am totally new to reduce_sum and it is very very likely that I am either doing something incredibly and obviously clumsy or I have somehow misinterpreted how reduce_sum should be used or, finally, I should simply not expect any performance boost under my current setup.
That said, I did start by profiling my non-parallelized code, and found that the speed bottleneck was definitely a for-loop that involved looping over 30 years of data and, for each year, make some matrix multiplications with matrices and vectors that have sizes of approximately 79000x20. I first did my best to optimize the for loop, e.g. by taking things out of it and vectorizing what I could, but maybe there’s still room for improvement there?. After hitting a plateau with that strategy, I thought I would pack the foor-loop calculations into a function (called year_model in the code I posted) that performs those calculations for a year worth of data. I hoped that passing that function to partial_sum and slicing over the 30 years of data would provide me with some performance boost (but maybe that’s too naïve of me?). As a result, my N is 30 and with 8 cores N/(2x8) would give a grainsize of approximately 2, which I tried.
Just as an aside, being completely new to this, I built my code following this example that was presented at the 2020 StanCon, where the author had 20 countries, built a function that would perform country-specific calculations, and then sliced over those 20 countries. I obviously don’t expect anyone to go and dig into this case study, I just thought I would mention this in case anyone was already familiar with this case study since it was presented at the 2020 StanCon.
One other thing I’d like to mention is that my model has 20 parameters and they are shared, so not sliced over. Could that be creating substantial overhead? Based on what I read on the forum shared parameters result in substantial overhead typically when they are in the orders of tens of thousands, but these parameters get multiplied by large matrices of data inside the partial_sum function, so not sure whether that is an issue?

Lastly, I could have access to more cores, but I really suspect that I am simply doing something wrong here that someone with experience woudl be able to point out to me very easily before I take this any further with more cores.

Thanks

What I was suggesting is having everything you’re putting in your partial sum function in its own function, and having partial sum’s only job be to slice data into that function. My code shows an examples of this. This is for ease of seeing whart partial sum is doing so that you can tell exactly what each slice of data should look like. For example, in my code, each slice of data is some collection of arrays, the indices of the array being matched with the indices of the containers being read in as partial vectors or matrices, depending on the parameter/transformed parameter. If the dynamic scheduler from the wrap function reduce_sum grabs slice_n_resps arrays 1:3, then along with it comes guess[1:3], sds[:,1:3], sm[:,1:3], nconf[1:3], and pconf[1:3]. From there it’s just the usual calculations by the likelihood function. Since your likelihood function is working fine, it might be of benefit to restructure your code so that it’s very clear to see what the partial sum is doing.