I still haven’t been able to get this to be efficient on any of the problems that I coded up even as I make the parallel work very large.
I thought there might be a performance problem with the threadlocal variables. That is not the case. I did some tests and convinced myself that isn’t right.
I made a little change to how the deep copies work. There was some overhead in allocating std::vectors that was a bit weird (https://github.com/stan-dev/math/pull/1616/commits/7a1a27b589debf44d42e6144bb614d0a97bd286b).
I coded up some a benchmark in test but couldn’t really get any weird behavior out of it. The idea was that there are three parameters to a calculation:
- The total number of parallel calculations
- The grainsize
- The size of the work
And I tried to characterize performance with this benchmark.
I think I need to add another which is number of parameters. Maybe that’s what I’ll do next. Just wanted to post an update and say I’m still not sure what’s going on.
#include <stan/math/prim/core.hpp>
#include <stan/math.hpp>
#include <gtest/gtest.h>
#include <algorithm>
#include <sstream>
#include <tuple>
#include <vector>
std::ostream* msgs = nullptr;
template <typename T>
struct count_lpdf {
count_lpdf() {}
inline T operator()(std::size_t start, std::size_t end,
const std::vector<int>& sub_slice, std::ostream* msgs,
const T& lambda,
int N) const {
using stan::math::var;
var sum = 0.0;
for(int j = start; j < end; j++) {
var lambda_mult = sub_slice[j - start] * lambda;
for(int i = 0; i < N; i++) {
sum += lambda_mult;
lambda_mult *= lambda;
}
}
return sum;
}
};
TEST(v3_reduce_sum_benchmarks, reduce_sum_small) {
using stan::math::var;
stan::math::init_threadpool_tbb();
std::vector<int> datasizes = { 1024, 4096, 16384 };
std::vector<size_t> grainsizes = { 8, 16, 32, 64, 128, 256 };
std::vector<int> worksizes = { 8, 16, 32, 64, 128, 256 };
std::cout << "which_parallel, datasize, grainsize, worksize, time" << std::endl;
for(auto datasize : datasizes) {
for(auto grainsize : grainsizes) {
for(auto worksize : worksizes) {
std::vector<int> data(datasize, 1);
var lambda_v = 0.5;
double time = omp_get_wtime();
var poisson_lpdf = 0.0;
for(int i = 0; i < 100; i++) {
poisson_lpdf += stan::math::reduce_sum<count_lpdf<var>>(data, grainsize, msgs, lambda_v, worksize);
}
std::cout << "reduce_sum, " << datasize << ", " << grainsize << ", " << worksize << ", " << omp_get_wtime() - time << std::endl;
}
}
}
stan::math::recover_memory();
}