Measuring and comparing computational performance in Stan with different compilation alternatives. Using reduce_sum does not bring any advantage?

I am doing some performance checks on stan with different compilation schemes and wanna share my impressions for some feedback. The code I am running is a Bayesian Neural Network where the likelihood is evaluated over N = 50000 observations. In this way I want to check the computational gain of using the reduce_sum parallelization provided by stan, and also how open MPI works here.

The code I use is the following:


// ** Juan Maroñas **
// An example of Stan to perform inference in a Bayesian Neural Network

functions {
  // parallelizes the evaluation of the categorical cross entropy  
  real partial_sum(int [] y_slice, int start, int end, matrix p ) 
  {

    real __target = 0; 
    int counter   = 1;
    for (n in start:end)
    {
      __target += categorical_logit_lpmf( y_slice[counter] | p[n]');
      counter  += 1;
    }

    return __target;
  }

  
  }

data {


      int<lower=0> N;     // number of draws from the model
      int<lower=2> C;     // number of logits, i.e number of classes and input dimension (assumed for simplicity)
      matrix[N,C]  x;     //  iid observations. X is matrix of shape N,C containing observations
      int<lower=1> y[N];  //  iid observations. Y is an array of length N containing the class label

      // definition of the prior p(w|0,I), passed as argument
      real          mu_w; 
      real<lower=0> std_w;

}

parameters {
      matrix[C,C] W; // the parameter we want to infer
}

model {
      // data declaration
      matrix[N,C] p;
      int grainsize = 1;

      // prior over the weights p(w)
      profile("Prior"){ 
      to_vector(W) ~ normal(mu_w,std_w); 
      }
      
      // likelihood p(t|NNet_w(x)) 
      profile("Matrix product"){
      p = x*W; // get parameters from Categorical Likelihood
      }

      // parallelize the computation
      profile("Likelihood evaluation"){
      target += reduce_sum(partial_sum, y, grainsize, p);
      }
}

I shall mention that the profiler (as expected) shows much more time put on the likelihood evaluation, as that is a reduction operator of a vector of N=50000 samples. The comparisons that I do are the following (always using OPENCL so computations are run on a 1080GTX GPU). My cpu is a Intel(R) Core(TM) i3-6100 CPU @ 3.70GHz with 4 threads (not the best one for sure), and the comparisons I do are the following (sampling just one MCMC chain):

  1. Compile with OPENMPI and STAN_THREADS flags, so that likelihood evaluation can be parallelized. Then I sample with option threads_per_chain being 4 vs 1 so that I can compare the effect of using multithreading on the reduction operator.

  2. Same as 1) but not compiling with OPENMPI

  3. Same as 1) but using 2 threads

What I observe is that when using 4 threads, then the version compiled with OPENMPI is much faster than the one without OPENMPI (as expected). The running times are 6031.62 vs 8433.97 seconds using 10 leapfrog steps and 1000 transitions. This is not surprising

My second observation is that only using 1 thread gives much better performance than using 4 threads (with OPENMPI making no difference obviously). In this it takes 2980.96 seconds.

My third observation is that using 2 threads an OPENMPI gives 2591.7 seconds which means that using the maximum number of threads is not always the best ( I know that using more threads than available can harm performance due to thread scheduling, but wasnt expecting this to happen when using the same number of threads as CPU cores available using such a huge reduction operation ~50K points)

How is it possible that using multithreading for the reduction operator takes 4 times more time that one thread? (kind of weird). My answers are:

a) The implementation provided above is wrong ( I think it is okay but quite new to stan so not 100% sure)

b) I have observed that the number of processes running on GPU is just one when using 4 threads. I was expecting each thread to run each own Kernels on the GPU. So perhaps the bootleneck comes from here?. When I sample N chains instead of 1 chain, I can observe that each thread that samples on each chain has its own cuda process open.

c) Using less threads than CPU cores works better, but was expecting that using the same number of threads as CPU cores available should be the best on such a huge reduction operation (around 50K) training points. Actually the profiler shows that the likelihood evaluation on a single thread takes two order of magnitude more than prior evaluation of matrix product in my Stan code; so we should expect a boost in performance from this parallelization

I am planning to set up the environment on another machine with more threads beyond 4 to make another comparison, but wanna open here first to hear back from your experience.

Any thoughts? Thank you

Can u slice over the rows of p here (converting it first to an array of row_vectors) ? That is much more efficient for reduce_sum to handle.

The other thing to consider: How large is W as compared to N in your program?

1 Like

Thank you.

I actually had that conversion to an array of row_vectors. But then I read that matrix should be indexed column wise for the most efficient indexing. Why is then more efficient to reshape first to an array of row_vectors rather than indexing columns directly? ( I have realized my matrix is not indexed by columns but could change that)

From here it seems that indexing columns from the matrix should be the choice: Stan User’s Guide , but happy to hear from you

Regarding your second question, C = 10 in my experiments, so W is C \times C, i.e much smaller than the reduction over N=50000.

Maybe I forgot… but you should slice over the array of row_vectors (so putting that in the first argument of reduce_sum).

Oh, but seeing your second point… then by all means do this first:

pass in the matrix W into reduce sum directly along with all what is needed to calculate p inside the partial sum function.

You need to limit the number of parameters which you pass into reduce sum as shared arguments. Data as shared argument does not hurt performance at all.

If I understand okay your point, then if I pass W plus whatever is needed to get p as shared arguments, then I will do several matrix products between W and the elements needed to recover p for that partial sum, something like p=x[start:end]*W. However, isnt more efficient to just do one matrix product, and then just access the necessary elements from the resulting matrix?. I think that if p = x*W is accessed columnwise, then that should not hurt performance?

So wondering if the following modifications would be the best:

functions {
  
  real partial_sum(int [] y_slice, int start, int end, matrix p ) 
  {

    real __target = 0; 
    int counter   = 1;
    for (n in start:end)
    {
      __target += categorical_logit_lpmf( y_slice[counter] | p[:,n]); # now index by columns as stated here https://mc-stan.org/docs/2_18/stan-users-guide/indexing-efficiency-section.html
      counter  += 1;
    }

    return __target;
  }

  
  }


model {
      // data declaration
      matrix[C,N] p; # so that it can be indexed column-wise
      int grainsize = 1;

      // prior over the weights p(w)
      profile("Prior"){ 
      to_vector(W) ~ normal(mu_w,std_w); 
      }
      
     
      // likelihood p(t|NNet_w(x)) 
      profile("Matrix product"){
      p = (x*W)'; // get parameters from Categorical Likelihood. Transpose final operation so that it can be indexed column wise
      }


      // parallelize the computation
      profile("Likelihood evaluation"){
      target += reduce_sum(partial_sum, y, grainsize, p);
      }

No. The communication cost is too heavy for that, I think.

Okay thanks, will go through that video and post here back.

okay, once I saw the video I think I have the answer. Actually minute 18 shows a slide which is quite insightful.

The shared data is accessed by reference unless the shared data is a parameter (like W in my example). So beyond the fact that a single matrix multiplication might be faster than several matrix multiplications, passing W to the reduce_sum function will require additional copies which can hurt performance. However, passing p directly will only pass a reference, hence the cost is only that of accessing p. Because p is a matrix, then accessing by columns should be done. I am going to do some performance checks within this line and post here my results. Thanks @wds15 for the video!

data is not relevant for the performance assessment. You can pass data at will without hurting performance usually.

but for parameters its critical what you do. Your p matrix is the product of a parameter matrix and data… which makes it a parameter.

1 Like

I was about to ask if p is a parameter because W is a parameter, and just saw your comment. Then in that case it makes sense what you say. I guess another alternative would be to slice p rather than y and pass y as a shared parameter?

given the size differences you describe you should really pass in W and compute p in the partial sum functions (the bits needed).

but yeah, slicing parameters instead of data is always much better.

yeah, will defo try and compare time performance. I want to really avoid making small matrix products as I am using a GPU for the matrix product, hence I think I can benefit on that side for making a single matrix product with the GPU.

uh… then stop using reduce_sum probably and instead use the categorical glm thing there is (which you should anyway use). Don’t bother with reduce_sum then… but matrix products are cheap usually.

1 Like

oh nice, haven’t really see that yesterday when looking at the reference_manual. Just looked for the categorical likelihood but this is actually what I need 13.6 Categorical logit generalized linear model (softmax regression) | Stan Functions Reference

I guess that even in the case that I dont use a GPU, this will be faster than reduce_sum because matrix products in Eigen use Blas library, hence making the product really fast.

Anyway will make different time comparisons to compare different possibilities. Thanks so much!

Nice. Then maybe a comparison of

  1. just one CPU onle
  2. reduce_sum
  3. GPU only
  4. combined reduce_sum & GPU

would be interesting.

Yep, will post that plus the different stan codes used. Will also investigate cache indexing due to column vs row indexing, and also slice over p rather than y.

Okay so I finally had some time for doing this.

I have tested 13 different STAN implementations of the same things (some of them I knew were gonna be inefficient based on our discussion but would like to know how much).

I have tested multithreading using openmpi vs nonopenmpi and 1 thread vs 10 threads. Also the comparison includes cpu vs gpu.

Surprisingly (and I guess it is due to memory transfer) the rank of all the tests I did over the 13 models is as follows:

  1. Pass W and X as a shared argument to reduce_sum, and slice over Y; then use Categorical_GLM function indexing columns of matrix X and then transposing running on CPU, 10 threads and openmpi

  2. Same as above but indexing rows of X directly.

3,4) Same as 1 and 2 but without using openmpi (the difference of using and not using openmpi was not very high)

  1. Running a single thread (i.e no openmpi and no stan threads) and using GPU computation over categorical GLM directly (as suggested by you).

  2. Same as 1,2 but computation is run in GPU, i.e each thread from the reduce_sum evaluates the Categorical GLM on a GPU.

The dataset had C = 10 classes and 10 features, and a total of 55K datapoints was used to evaluate the likelihood. I used a 2080RTX TI Nvidia GPU. The times where computed by running the program 5 times and then average results

Not sure if it would be interesting to have all the code and the excell files with the runtimes in github or in a place where stan users could check it?

What is the result?

How about a ranked list? Fastest to slowest with a short description about the configuration.

Having all code and ouput somewhere accessible would nice, sure.

yes sorry. 1) is fastest and 6) is slowest.

Will upload everything to github when having a bit of time. For the moment I am attaching the csv files with the time comparisons time_comparison.csv (3.8 KB)