I have a large spatiotemporal hierarchical model that takes hours to run, so I am now considering the use of reduce_sum to try to speed things up a bit. However, in my first attempt, the model that uses reduce_sum is actually slower. To try to understand things better, I’ve tested reduce_sum on a simple example. Basically, I generate 50,000 samples from a GEV distribution and then try to infer the GEV parameters from the simulated data. In this example, the model that uses reduce_sum is three times faster (1 min vs 3 min). However, note that I have about 144 ‘CPUs’ available for parallel threads: 4 sockets x 18 cores per socket x 2 threads per core = 144 CPUs. I set grainsize = 1, and with this Stan seems to use 60 CPUs when on reduce_sum. A 3x speed increase is significant but I was expecting more given that Stan was using 60 CPUs. I guess, is this expected? I just want to make sure I am doing things right before I go through the effort of introducing reduce_sum into my larger model.

Stan code below:

```
functions{
real gev_lpdf(real[] y,real mu,real sigma,real xi,int n){
vector[n] z;
for(i in 1:n){
z[i] = (1 + xi * (y[i] - mu) / sigma)^(-1/xi);
}
return -sum(log(sigma) - (1 + xi) * log(z) + z);
}
real partial_sum(real[] ys,int is,int ie,real mu,real sigma,real xi){
int ns = ie - is + 1;
return gev_lpdf(ys | mu,sigma,xi,ns);
}
}
data{
int<lower=0> n;
real y[n];
}
transformed data{
int grainsize = 1;
}
parameters{
real mu;
real<lower=0> sigma;
real xi;
}
model{
mu ~ normal(0,2);
sigma ~ normal(0,1);
xi ~ normal(0,1);
//y ~ gev(mu,sigma,xi,n); // no threading
target += reduce_sum(partial_sum,y,grainsize,mu,sigma,xi);
}
```