I’m making a new thread following up on: Parallel autodiff v3

In that thread we designed and implemented a `reduce_sum`

function.

We designed a function `reduce_sum`

with signature:

```
real reduce_sum(F func, T[] sliced_arg, int grainsize, T1 arg1, T2 arg2, ...)
```

Where `func`

is a function with the signature:

```
real func(int start, int end, T[] sliced_arg, T1 arg1, T2 arg2, ...)
```

`reduce_sum`

implements the functionality defined by:

```
int N = size(sliced_arg);
real sum = func(1, N, sliced_arg, arg1, arg2, ...)
```

in parallel by assuming that the work done by func can be equivalently broken up into pieces:

```
int N = size(sliced_arg);
real sum = func(1, M, sliced_arg[1:M], arg1, arg2, ...) +
func(M + 1, N, sliced_arg[M + 1:N], arg1, arg2, ...)
```

where `1 <= M < N`

(and this can be repeated recursively).

There’s currently a math branch that implements this in C++: https://github.com/stan-dev/math/pull/1616

And a branch of stanc3 that handles the language bits (the linux binary is available here: https://github.com/stan-dev/stanc3/pull/451#issuecomment-584347523).

Edit: @rok_cesnovar made a branch of cmdstan that has the Linux binaries included: https://github.com/stan-dev/cmdstan/tree/parallel_reduce_sum (Parallel autodiff v4)

Alright and the actual thing I wanted to get to is that I think I have a model that has linear scaling by running stuff in parallel but with threading I only get a 2x speedup.

Here is an example model with data:

nbme1.stan (1.1 KB) nbme_small.data.R (2.6 MB)

You can run this model with:

```
./nbme1 sample data file=nbme_small.data.R
```

Here is a threaded version for use with the new Math branch and the new stanc3 compiler: nbme6.stan (1.7 KB)

You can run this model with:

```
STAN_NUM_THREADS=4 ./nbme6 sample data file=nbme_small.data.R
```

where STAN_NUM_THREADS sets the number of threads.

It takes about 18 seconds for the non-reduce_sum version of the model to run. It takes about 22 seconds for a single-threaded version of the reduce_sum version of the model to run.

That doesn’t bother me so much, but the scaling for the threaded version seems quite bad.

With four threads it takes about 11 seconds for the reduce_sum model to run. So it would take 44 seconds or so for four chains. However, if I run four chains in parallel I get linear speedup! (they finish in 18~ seconds still).

You can compare against running four chains with a simple run script, copy paste this to something like run.sh:

```
./nbme1 sample data file=nbme_small.data.R output file=output1.csv
./nbme1 sample data file=nbme_small.data.R output file=output2.csv
./nbme1 sample data file=nbme_small.data.R output file=output3.csv
./nbme1 sample data file=nbme_small.data.R output file=output4.csv
```

And then do:

```
cat run.sh | parallel -j 4
```

To run four chains in parallel.

You can make this model harder or easier by editing the `N_subset`

variable in nbme_small.data.R. As I uploaded it, it only uses 2000 bits of data. You can push that up to 40000.