Reduce_sum results in much slower run times, even for large datasets

andrjohns · March 16, 2022, 9:19pm

When working with reduce_sum, any arguments that aren’t ‘sliced’ get copied in full to every core/thread. This copy cost is much greater for parameters than for data (since both the value and the gradient need to get copied).

When calling: array[N] real Xb=to_array_1d(X*beta); // linear predictors, the entire N values are upgraded to parameters. It will probably be more efficient to copy X and beta separately, so that only M values are parameters.

And to second @jsocolar’s point, reduce_sum needs the complexity of the likelihood to outweigh the overhead of parallelism. If you have access to a GPU, you’ll likely get much better performance by using normal_id_glm with opencl

Topic		Replies	Views
Can't get reduce_sum to help with model runtime Modeling	7	445	December 19, 2023
Model with reduce_sum takes too long Modeling	28	1702	December 28, 2020
Reduce_sum parallelisation issue Modeling cmdstanr , multivariate-normal	12	1130	February 24, 2022
Reduce-sum parallelization General techniques , paralellization	18	327	August 1, 2024
Reduce_sum performance Modeling performance , paralellization	15	1484	September 28, 2020

Reduce_sum results in much slower run times, even for large datasets

Related topics