I am doing some performance checks on stan with different compilation schemes and wanna share my impressions for some feedback. The code I am running is a Bayesian Neural Network where the likelihood is evaluated over N = 50000
observations. In this way I want to check the computational gain of using the reduce_sum
parallelization provided by stan, and also how open MPI works here.
The code I use is the following:
// ** Juan Maroñas **
// An example of Stan to perform inference in a Bayesian Neural Network
functions {
// parallelizes the evaluation of the categorical cross entropy
real partial_sum(int [] y_slice, int start, int end, matrix p )
{
real __target = 0;
int counter = 1;
for (n in start:end)
{
__target += categorical_logit_lpmf( y_slice[counter]  p[n]');
counter += 1;
}
return __target;
}
}
data {
int<lower=0> N; // number of draws from the model
int<lower=2> C; // number of logits, i.e number of classes and input dimension (assumed for simplicity)
matrix[N,C] x; // iid observations. X is matrix of shape N,C containing observations
int<lower=1> y[N]; // iid observations. Y is an array of length N containing the class label
// definition of the prior p(w0,I), passed as argument
real mu_w;
real<lower=0> std_w;
}
parameters {
matrix[C,C] W; // the parameter we want to infer
}
model {
// data declaration
matrix[N,C] p;
int grainsize = 1;
// prior over the weights p(w)
profile("Prior"){
to_vector(W) ~ normal(mu_w,std_w);
}
// likelihood p(tNNet_w(x))
profile("Matrix product"){
p = x*W; // get parameters from Categorical Likelihood
}
// parallelize the computation
profile("Likelihood evaluation"){
target += reduce_sum(partial_sum, y, grainsize, p);
}
}
I shall mention that the profiler (as expected) shows much more time put on the likelihood evaluation, as that is a reduction operator of a vector of N=50000
samples. The comparisons that I do are the following (always using OPENCL so computations are run on a 1080GTX GPU). My cpu is a Intel(R) Core(TM) i36100 CPU @ 3.70GHz
with 4 threads (not the best one for sure), and the comparisons I do are the following (sampling just one MCMC chain):

Compile with
OPENMPI
andSTAN_THREADS
flags, so that likelihood evaluation can be parallelized. Then I sample with optionthreads_per_chain
being 4 vs 1 so that I can compare the effect of using multithreading on the reduction operator. 
Same as 1) but not compiling with OPENMPI

Same as 1) but using 2 threads
What I observe is that when using 4 threads, then the version compiled with OPENMPI is much faster than the one without OPENMPI (as expected). The running times are 6031.62
vs 8433.97
seconds using 10 leapfrog steps and 1000 transitions. This is not surprising
My second observation is that only using 1 thread gives much better performance than using 4 threads (with OPENMPI making no difference obviously). In this it takes 2980.96
seconds.
My third observation is that using 2 threads an OPENMPI gives 2591.7
seconds which means that using the maximum number of threads is not always the best ( I know that using more threads than available can harm performance due to thread scheduling, but wasnt expecting this to happen when using the same number of threads as CPU cores available using such a huge reduction operation ~50K points)
How is it possible that using multithreading for the reduction operator takes 4 times more time that one thread? (kind of weird). My answers are:
a) The implementation provided above is wrong ( I think it is okay but quite new to stan so not 100% sure)
b) I have observed that the number of processes running on GPU is just one when using 4 threads. I was expecting each thread to run each own Kernels on the GPU. So perhaps the bootleneck comes from here?. When I sample N chains instead of 1 chain, I can observe that each thread that samples on each chain has its own cuda process open.
c) Using less threads than CPU cores works better, but was expecting that using the same number of threads as CPU cores available should be the best on such a huge reduction operation (around 50K) training points. Actually the profiler shows that the likelihood evaluation on a single thread takes two order of magnitude more than prior evaluation of matrix product in my Stan code; so we should expect a boost in performance from this parallelization
I am planning to set up the environment on another machine with more threads beyond 4 to make another comparison, but wanna open here first to hear back from your experience.
Any thoughts? Thank you