Reduce_sum performance

Stan 2.23.0

To check the overhead of reduce_sum, I compared the runtime of a model with 1 thread per chain,

  target += reduce_sum(partial_sum, outcome, 1, NA_int,
                       thr, eVar, cScale, family, familyData, cMeanVar,
                       siteData, sScale, site, sMeanVar,
                       srelp, aScale, twin, twinData, relq, aMeanVar);

Against calling partial_sum directly,

  target += partial_sum(outcome, 1, N, NA_int,
                        thr, eVar, cScale, family, familyData, cMeanVar,
                        siteData, sScale, site, sMeanVar,
                        srelp, aScale, twin, twinData, relq, aMeanVar);

The runtime increased from 176 s for partial_sum to 243 s for reduce_sum. If I increase to 2 threads per chain then the reduce_sum version only takes 198 s. However, that still seems pretty poor compare to partial_sum. Maybe the benefit of threads kick in sooner for models with larger data? The N I’m testing with is only 11525.

1 Like

Parallelization always implies some overhead!

It‘s hard to say anything without seeing a good deal of the model.

Probably you need to move more of your model into the partial sum function - or use a larger grainsize.

The statement you make is super general in the way written - that‘s just not quite appropiate. It depends on the details tremendously.

I have seen many cases where comparing the partial_sum function call vs the reduce_sum thing reduces substantially the runtime.

Here’s my model, b1.stan.txt (2.8 KB). Do you need to actually run it to investigate or can you eyeball it?

How does the grainsize=1 autotuning work?

Well, a normal likelihood is super cheap to calculate. Don’t expect too much here. You need large N and a grainsize which is large enough to get somewhere meaningful.

However, you should really not add elementwise the normal-log-lik to result. Instead create a large mean-vector and just do a single call to normal_lpdf.

And you should try to avoid all those to* calls. These create a lot of variables on the AD stack… not good.

1 Like

Are you saying that to_array_1d and to_vector are costly?

Yeah, but calling reduce_sum with grainsize=N and 1 thread per chain adds 30 s to the runtime compared to calling partial_sum directly. Is this much of a slow-down expected?

Yes for parameter arrays it creates a copy of the array’s parameters

1 Like

On performance and tips like those above more generally, it seems these relate to knowing the underlying way different types of functions do things with memory and such. How feasible would it be to expand the Stan manual write-ups on performance for a collection of these types of things to watch out for?

Not really.

Good suggestion. Looks like our huge manual does not include performance tips…right @Bob_Carpenter?

There are some performance tips in the manual. I’m not sure you want expand the manual in this direction because it would likely become out-of-date as various improvements are made to the math library and code generation.

Oh… indeed

…and these tips there are far more important than to start with reduce sum…

1 Like

This change was a big win. With 1 thread per chain, runtime was reduced to 67 s from 200 s.

2 Likes

I optimized now a few models with reduce sum…Each time I found a 2x, 3x or even 4x without reduce sum being used.

This is unfortunately not really a quality thing for Stan that you should know these details.

I appreciate the sympathy. Now that I fiddled more, I doubt that repeated calls of normal_lpdf was the issue. I suspect that repeated estimates of residVar was killing performance.

It would be great if there was some kind of code profiler diagnostics so I could see where time was being spent. It takes so much effort to refactor a model and the benefit is often uncertain.

3 Likes

what about rep_array and rep_matrix? Do these add copies to the AD stack?

Is there anyplace where we could see this?