Hi!

I have been working on a parallel map and parallel reduce functions in Stan. This is based on the Intel TBB (so no more sharding) and the parallel reduce allows to combine vectorization with parallelization - and I am seeing very promising results with my prototype I have, so I think it’s about time to start to think about how this could look like function signature wise.

Currently I am envisioning for the reduce (summation reduction only) a C++ signature:

```
template <class InputIt, class T, class BinaryFunction>
constexpr T parallel_reduce_sum(InputIt first, InputIt last, T init, BinaryFunction f);
```

So you give it an arbitrary input iterator sequence, the initial element to which the sum is initialized to and a binary function f. The binary function will be called with sub-slices `f(*slice_start, *slice_end)`

and the function is then expected to return the sum reduction in the sub-range as represented by the iterators.

The idea is to glue things together using C++ lambda’s. So this is how it could look like being using in C++ code:

```
typedef boost::counting_iterator<std::size_t> count_iter;
std::vector<int> data(10000, 4);
var lambda = 6.0;
var poisson_lpdf = stan::math::parallel_reduce_sum(
count_iter(0), count_iter(10000), var(0.0),
[&](std::size_t start, std::size_t end) {
const std::vector<int> partial_data(data.begin() + start,
data.begin() + end);
return stan::math::poisson_lpmf(partial_data, lambda);
});
```

I think that is very neat and useful. The user can make references to any objects in the environment and plugging things together is very easy.

The question now is how can we expose this to the Stan language and do we have to wait for Stan3 for this type of stuff.

One possibility could be

```
functions {
real reducer(int start, int end, real[] data, real lambda) {
return poisson_lpmf(data[start:end] | lambda);
}
}
...
model {
target += parallel_reduce(reducer, data, lambda);
}
```

So the Stan function parallel_reduce would take as first argument the function doing the reduction. The reducer function then gets passed in as first to arguments the starting and end index for the slice to be reduced. Followed by that would be an arbitrary number of additional arguments which get passed on to the reducer function (as const refs, of course).

Does that make sense to people? Can we pull this off with the current parser or should we wait for Stan3? I would hope that we can get this in without too much effort in the current parser. Though, I have to say that getting this to work will require a bit of refactoring at the backend. However, I have coded up a prototype which reduces 10^4 data items of a poisson likelihood and this approach can double the evaluation speed of such a loop! So modestly large models get speed up by this already and large models will get a scalable program as we combine vectorization with parallelization.

Needless to say that all of this will be some work to pull off… and I am happily taking any hand I can take here.

Cheers,

Sebastian

For completeness, the parallel map would look like this in C++:

```
template <class InputIt, class UnaryFunction>
constexpr auto parallel_map(InputIt first, InputIt last, UnaryFunction f);
```

The function then returns a `std::vector`

and each element of that sequence stores a function evaluation of `f(*elem)`

where `elem`

is an iterator between `first`

and `last`

. With the logic above this basically enables a much easier to use variant of `map_rect`

, because no more data packing or unpacking is needed at all. This should allow a much bigger user-base to use parallelization as it avoids index wars for the user.