Reduce_sum performance

Joshua_Pritikin · May 15, 2020, 2:35am

Stan 2.23.0

To check the overhead of reduce_sum, I compared the runtime of a model with 1 thread per chain,

  target += reduce_sum(partial_sum, outcome, 1, NA_int,
                       thr, eVar, cScale, family, familyData, cMeanVar,
                       siteData, sScale, site, sMeanVar,
                       srelp, aScale, twin, twinData, relq, aMeanVar);

Against calling partial_sum directly,

  target += partial_sum(outcome, 1, N, NA_int,
                        thr, eVar, cScale, family, familyData, cMeanVar,
                        siteData, sScale, site, sMeanVar,
                        srelp, aScale, twin, twinData, relq, aMeanVar);

The runtime increased from 176 s for partial_sum to 243 s for reduce_sum. If I increase to 2 threads per chain then the reduce_sum version only takes 198 s. However, that still seems pretty poor compare to partial_sum. Maybe the benefit of threads kick in sooner for models with larger data? The N I’m testing with is only 11525.

wds15 · May 15, 2020, 7:51am

Parallelization always implies some overhead!

It‘s hard to say anything without seeing a good deal of the model.

Probably you need to move more of your model into the partial sum function - or use a larger grainsize.

The statement you make is super general in the way written - that‘s just not quite appropiate. It depends on the details tremendously.

I have seen many cases where comparing the partial_sum function call vs the reduce_sum thing reduces substantially the runtime.

Joshua_Pritikin · May 15, 2020, 1:29pm

Here’s my model, b1.stan.txt (2.8 KB). Do you need to actually run it to investigate or can you eyeball it?

How does the grainsize=1 autotuning work?

wds15 · May 15, 2020, 2:32pm

Well, a normal likelihood is super cheap to calculate. Don’t expect too much here. You need large N and a grainsize which is large enough to get somewhere meaningful.

However, you should really not add elementwise the normal-log-lik to result. Instead create a large mean-vector and just do a single call to normal_lpdf.

And you should try to avoid all those to* calls. These create a lot of variables on the AD stack… not good.

Joshua_Pritikin · May 15, 2020, 2:46pm

Are you saying that to_array_1d and to_vector are costly?

Joshua_Pritikin · May 15, 2020, 3:25pm

Yeah, but calling reduce_sum with grainsize=N and 1 thread per chain adds 30 s to the runtime compared to calling partial_sum directly. Is this much of a slow-down expected?

stevebronder · May 15, 2020, 4:06pm

Yes for parameter arrays it creates a copy of the array’s parameters

ssp3nc3r · May 15, 2020, 5:06pm

On performance and tips like those above more generally, it seems these relate to knowing the underlying way different types of functions do things with memory and such. How feasible would it be to expand the Stan manual write-ups on performance for a collection of these types of things to watch out for?

wds15 · May 15, 2020, 6:19pm

Not really.

wds15 · May 15, 2020, 6:20pm

Good suggestion. Looks like our huge manual does not include performance tips…right @Bob_Carpenter?

Joshua_Pritikin · May 15, 2020, 6:33pm

There are some performance tips in the manual. I’m not sure you want expand the manual in this direction because it would likely become out-of-date as various improvements are made to the math library and code generation.

wds15 · May 15, 2020, 6:53pm

Oh… indeed

…and these tips there are far more important than to start with reduce sum…

Joshua_Pritikin · May 19, 2020, 5:13pm

This change was a big win. With 1 thread per chain, runtime was reduced to 67 s from 200 s.

wds15 · May 19, 2020, 9:18pm

I optimized now a few models with reduce sum…Each time I found a 2x, 3x or even 4x without reduce sum being used.

This is unfortunately not really a quality thing for Stan that you should know these details.

Joshua_Pritikin · May 19, 2020, 9:24pm

I appreciate the sympathy. Now that I fiddled more, I doubt that repeated calls of normal_lpdf was the issue. I suspect that repeated estimates of residVar was killing performance.

It would be great if there was some kind of code profiler diagnostics so I could see where time was being spent. It takes so much effort to refactor a model and the benefit is often uncertain.

mathDR · September 28, 2020, 4:36pm

what about rep_array and rep_matrix? Do these add copies to the AD stack?

Is there anyplace where we could see this?

Topic		Replies	Views
Reduce_sum results in much slower run times, even for large datasets Algorithms paralellization	6	1421	March 17, 2022
Reduce_sum performance many threads Modeling	9	659	August 14, 2020
Can't get reduce_sum to help with model runtime Modeling	7	349	December 19, 2023
Reduce_sum performance Modeling	5	867	May 22, 2020
Understanding reduce_sum efficiency Modeling	10	850	March 22, 2021

Reduce_sum performance

Related topics