Stan 2.23.0
To check the overhead of reduce_sum, I compared the runtime of a model with 1 thread per chain,
target += reduce_sum(partial_sum, outcome, 1, NA_int,
thr, eVar, cScale, family, familyData, cMeanVar,
siteData, sScale, site, sMeanVar,
srelp, aScale, twin, twinData, relq, aMeanVar);
Against calling partial_sum
directly,
target += partial_sum(outcome, 1, N, NA_int,
thr, eVar, cScale, family, familyData, cMeanVar,
siteData, sScale, site, sMeanVar,
srelp, aScale, twin, twinData, relq, aMeanVar);
The runtime increased from 176 s for partial_sum
to 243 s for reduce_sum
. If I increase to 2 threads per chain then the reduce_sum
version only takes 198 s. However, that still seems pretty poor compare to partial_sum
. Maybe the benefit of threads kick in sooner for models with larger data? The N I’m testing with is only 11525.