Hi!

I have put up a parallel design for review. Along with it there is already a working prototype and I had created a toy example of a large poisson log-lik for a bigger reduce problem which showed promising results. Specifically, I added up 10^4 terms quite often and on 2 cores the speed doubled, on 3 cores the speed got a bit better, but with 4 cores it did not improve (got even worse somewhat). Increasing the number of terms should have improved the scaling with more performance - but it didnâ€™t!!!

So I started to turn my prototype inside out and improved a few things here and there until I finally found the bummer: The numerical effort was, of course, generated by the log-Gamma functionâ€¦and that stupid thing from the standard C++ library is NOT thread-safe. Instead that stupid function uses a mutex internally for the sign of the output - autsch! So I switched to the boost implementation of the log Gamma function which works mutex and this lock free.

Now the scaling with 10 repetitions of 10^8 terms gives almost perfect scaling of the performance:

```
cores = 1
[ OK ] benchmark.parallel_reduce_sum_speed (101036 ms)
2
[ OK ] benchmark.parallel_reduce_sum_speed (50921 ms)
3
[ OK ] benchmark.parallel_reduce_sum_speed (34702 ms)
4
[ OK ] benchmark.parallel_reduce_sum_speed (26710 ms)
5
[ OK ] benchmark.parallel_reduce_sum_speed (21591 ms)
6
[ OK ] benchmark.parallel_reduce_sum_speed (19170 ms)
```

Of course, the problem must scale nicely since the likelihood is so huge with so many terms. I will play a bit with that in the next days to find out how small the problem can be and still get good scaling. However, all in all I am now really optimistic in that we are on the right track here.

I am sorry to be a little to tired now for plottingâ€¦ next time.

Best,

Sebastian

BTW: Can we **please** switch back to the boost lgamma??? The std lgamma *destroys* any parallel programs performance.

FYIâ€¦ the C++ functor doing the work is really lean (so this stuff can turn our prob dists into parallel super nodes easily; one could even think about automating this):

```
template <typename T>
struct count_lpdf {
const std::vector<int>& data_;
const T& lambda_;
count_lpdf(const std::vector<int>& data, const T& lambda)
: data_(data), lambda_(lambda) {}
inline T operator()(std::size_t start, std::size_t end) const {
std::vector<int> partial_data;
partial_data.insert(partial_data.end(), data_.begin() + start,
data_.begin() + end + 1);
return stan::math::poisson_lpmf(partial_data, lambda_);
}
};
```