Reduce_sum performance

I have a large spatiotemporal hierarchical model that takes hours to run, so I am now considering the use of reduce_sum to try to speed things up a bit. However, in my first attempt, the model that uses reduce_sum is actually slower. To try to understand things better, I’ve tested reduce_sum on a simple example. Basically, I generate 50,000 samples from a GEV distribution and then try to infer the GEV parameters from the simulated data. In this example, the model that uses reduce_sum is three times faster (1 min vs 3 min). However, note that I have about 144 ‘CPUs’ available for parallel threads: 4 sockets x 18 cores per socket x 2 threads per core = 144 CPUs. I set grainsize = 1, and with this Stan seems to use 60 CPUs when on reduce_sum. A 3x speed increase is significant but I was expecting more given that Stan was using 60 CPUs. I guess, is this expected? I just want to make sure I am doing things right before I go through the effort of introducing reduce_sum into my larger model.

Stan code below:

functions{
    real gev_lpdf(real[] y,real mu,real sigma,real xi,int n){
        vector[n] z;
            
        for(i in 1:n){
            z[i] = (1 + xi * (y[i] - mu) / sigma)^(-1/xi);
        }
        return -sum(log(sigma) - (1 + xi) * log(z) + z); 
    }
        
    real partial_sum(real[] ys,int is,int ie,real mu,real sigma,real xi){
        int ns = ie - is + 1;
        return gev_lpdf(ys | mu,sigma,xi,ns);   
    }
}  
data{
    int<lower=0> n;
    real y[n];
}
transformed data{
    int grainsize = 1;
}
parameters{
    real mu;
    real<lower=0> sigma;
    real xi;
}
model{
    mu ~ normal(0,2);
    sigma ~ normal(0,1);
    xi ~ normal(0,1);
        
    //y ~ gev(mu,sigma,xi,n); // no threading
    target += reduce_sum(partial_sum,y,grainsize,mu,sigma,xi);
}

What a machine! First, hyperthreading usually never buys you anything since Stan needs a lot of cache. The fact that things are spread on many sockets is also not so good, I think.

You will probably get 3x speedup with much less CPUs. You should try to replace the large for loop by a single vectorized expression, if you can.

The grainsize of 1 is too small for you problem. I would try 500 or so as a start.

Also, have a read of Amdahls law on wikipedia. You may spend 1000 cores and only get 20x speedup in some problems.

Thanks for your reply, wds15. I see now that choosing a sensible grainsize is important. I tried several values for the grainsize but, in this case, I can’t do better than ~1 min. And thanks also for the reference to Amdahl’s law, which makes a lot of sense.

There is no vectorized version of pow() in Stan, so I am afraid that the loop is unavoidable.

There’s a vectorized log, that loop is equivalent to

real gev_lpdf(real[] y, real mu, real sigma, real xi, int n){
    vector[n] log_z = log1p(xi * (to_vector(y) - mu) / sigma) * (-1/xi);
    return -n * log(sigma) + (1 + xi) * sum(log_z) - sum(exp(log_z));
}
2 Likes

1m is already very fast…you would need a larger problem to see larger speed ups. Getting multiple h or days down to something to less than or just a few h is what I think makes sense.

The suggestion from @nhuurre sounds very good.

Thanks nhuurre for the suggestion. That’s interesting because I already tried the trick exp(n*log(x)) in the bigger version of my model, as I discussed in this thread, and this seemed to make the code slower than looping. Surprisingly, when I try the trick in the simple model above the model runs in only 12 s! It’s amazing the large effect small changes can have, and such effect can be different depending on the model.