Reduce_sum performance

Louis · May 22, 2020, 3:43pm

I have a large spatiotemporal hierarchical model that takes hours to run, so I am now considering the use of reduce_sum to try to speed things up a bit. However, in my first attempt, the model that uses reduce_sum is actually slower. To try to understand things better, I’ve tested reduce_sum on a simple example. Basically, I generate 50,000 samples from a GEV distribution and then try to infer the GEV parameters from the simulated data. In this example, the model that uses reduce_sum is three times faster (1 min vs 3 min). However, note that I have about 144 ‘CPUs’ available for parallel threads: 4 sockets x 18 cores per socket x 2 threads per core = 144 CPUs. I set grainsize = 1, and with this Stan seems to use 60 CPUs when on reduce_sum. A 3x speed increase is significant but I was expecting more given that Stan was using 60 CPUs. I guess, is this expected? I just want to make sure I am doing things right before I go through the effort of introducing reduce_sum into my larger model.

Stan code below:

functions{
    real gev_lpdf(real[] y,real mu,real sigma,real xi,int n){
        vector[n] z;
            
        for(i in 1:n){
            z[i] = (1 + xi * (y[i] - mu) / sigma)^(-1/xi);
        }
        return -sum(log(sigma) - (1 + xi) * log(z) + z); 
    }
        
    real partial_sum(real[] ys,int is,int ie,real mu,real sigma,real xi){
        int ns = ie - is + 1;
        return gev_lpdf(ys | mu,sigma,xi,ns);   
    }
}  
data{
    int<lower=0> n;
    real y[n];
}
transformed data{
    int grainsize = 1;
}
parameters{
    real mu;
    real<lower=0> sigma;
    real xi;
}
model{
    mu ~ normal(0,2);
    sigma ~ normal(0,1);
    xi ~ normal(0,1);
        
    //y ~ gev(mu,sigma,xi,n); // no threading
    target += reduce_sum(partial_sum,y,grainsize,mu,sigma,xi);
}

wds15 · May 22, 2020, 6:55pm

What a machine! First, hyperthreading usually never buys you anything since Stan needs a lot of cache. The fact that things are spread on many sockets is also not so good, I think.

You will probably get 3x speedup with much less CPUs. You should try to replace the large for loop by a single vectorized expression, if you can.

The grainsize of 1 is too small for you problem. I would try 500 or so as a start.

Also, have a read of Amdahls law on wikipedia. You may spend 1000 cores and only get 20x speedup in some problems.

Louis · May 22, 2020, 8:21pm

Thanks for your reply, wds15. I see now that choosing a sensible grainsize is important. I tried several values for the grainsize but, in this case, I can’t do better than ~1 min. And thanks also for the reference to Amdahl’s law, which makes a lot of sense.

There is no vectorized version of pow() in Stan, so I am afraid that the loop is unavoidable.

nhuurre · May 22, 2020, 8:40pm

There’s a vectorized log, that loop is equivalent to

real gev_lpdf(real[] y, real mu, real sigma, real xi, int n){
    vector[n] log_z = log1p(xi * (to_vector(y) - mu) / sigma) * (-1/xi);
    return -n * log(sigma) + (1 + xi) * sum(log_z) - sum(exp(log_z));
}

wds15 · May 22, 2020, 9:04pm

1m is already very fast…you would need a larger problem to see larger speed ups. Getting multiple h or days down to something to less than or just a few h is what I think makes sense.

The suggestion from @nhuurre sounds very good.

Louis · May 22, 2020, 9:14pm

Thanks nhuurre for the suggestion. That’s interesting because I already tried the trick exp(n*log(x)) in the bigger version of my model, as I discussed in this thread, and this seemed to make the code slower than looping. Surprisingly, when I try the trick in the simple model above the model runs in only 12 s! It’s amazing the large effect small changes can have, and such effect can be different depending on the model.

Topic		Replies	Views
Reduce_sum performance Modeling performance , paralellization	15	1345	September 28, 2020
Understanding reduce_sum efficiency Modeling	10	850	March 22, 2021
Multilevel hurdle model -- no performance increase with reduce_sum() Modeling performance	5	543	September 12, 2021
Reduce_sum results in much slower run times, even for large datasets Algorithms paralellization	6	1421	March 17, 2022
Did i implement reduce sum correctly? Modeling techniques , specification , cmdstanr	1	115	June 19, 2024

Reduce_sum performance

Related topics