Stan 2.23.0

To check the overhead of reduce_sum, I compared the runtime of a model with 1 thread per chain,

```
target += reduce_sum(partial_sum, outcome, 1, NA_int,
thr, eVar, cScale, family, familyData, cMeanVar,
siteData, sScale, site, sMeanVar,
srelp, aScale, twin, twinData, relq, aMeanVar);
```

Against calling `partial_sum`

directly,

```
target += partial_sum(outcome, 1, N, NA_int,
thr, eVar, cScale, family, familyData, cMeanVar,
siteData, sScale, site, sMeanVar,
srelp, aScale, twin, twinData, relq, aMeanVar);
```

The runtime increased from 176 s for `partial_sum`

to 243 s for `reduce_sum`

. If I increase to 2 threads per chain then the `reduce_sum`

version only takes 198 s. However, that still seems pretty poor compare to `partial_sum`

. Maybe the benefit of threads kick in sooner for models with larger data? The N I’m testing with is only 11525.