Scaling of proposed parallel design stuff

wds15 · May 22, 2019, 9:05pm

Hi!

I have put up a parallel design for review. Along with it there is already a working prototype and I had created a toy example of a large poisson log-lik for a bigger reduce problem which showed promising results. Specifically, I added up 10^4 terms quite often and on 2 cores the speed doubled, on 3 cores the speed got a bit better, but with 4 cores it did not improve (got even worse somewhat). Increasing the number of terms should have improved the scaling with more performance - but it didn’t!!!

So I started to turn my prototype inside out and improved a few things here and there until I finally found the bummer: The numerical effort was, of course, generated by the log-Gamma function…and that stupid thing from the standard C++ library is NOT thread-safe. Instead that stupid function uses a mutex internally for the sign of the output - autsch! So I switched to the boost implementation of the log Gamma function which works mutex and this lock free.

Now the scaling with 10 repetitions of 10^8 terms gives almost perfect scaling of the performance:

cores = 1
[       OK ] benchmark.parallel_reduce_sum_speed (101036 ms)
2
[       OK ] benchmark.parallel_reduce_sum_speed (50921 ms)
3
[       OK ] benchmark.parallel_reduce_sum_speed (34702 ms)
4
[       OK ] benchmark.parallel_reduce_sum_speed (26710 ms)
5
[       OK ] benchmark.parallel_reduce_sum_speed (21591 ms)
6
[       OK ] benchmark.parallel_reduce_sum_speed (19170 ms)

Of course, the problem must scale nicely since the likelihood is so huge with so many terms. I will play a bit with that in the next days to find out how small the problem can be and still get good scaling. However, all in all I am now really optimistic in that we are on the right track here.

I am sorry to be a little to tired now for plotting… next time.

Best,
Sebastian

BTW: Can we please switch back to the boost lgamma??? The std lgamma destroys any parallel programs performance.

FYI… the C++ functor doing the work is really lean (so this stuff can turn our prob dists into parallel super nodes easily; one could even think about automating this):

template <typename T>
struct count_lpdf {
  const std::vector<int>& data_;
  const T& lambda_;

  count_lpdf(const std::vector<int>& data, const T& lambda)
      : data_(data), lambda_(lambda) {}

  inline T operator()(std::size_t start, std::size_t end) const {
    std::vector<int> partial_data;
    partial_data.insert(partial_data.end(), data_.begin() + start,
                        data_.begin() + end + 1);
    return stan::math::poisson_lpmf(partial_data, lambda_);
  }
};

sakrejda · May 23, 2019, 12:04am

I’d second not just using the std:: math functions without testing. We sort of got into a global policy of going in that direction because they’re available but iirc the lgamma isn’t even required to be thread safe at all, it’s just our compilers’ implementations that happen to be. They were not written with us in mind.

wds15 · May 25, 2019, 8:12am

@sakrejda: Yeah, we probably should do better… the lgamma function is super important, but right now we do not test accuracy at all of what we get with the implementation at hand. The boost implementation does have cross platform guarantees on accuracy.

So I finally found the time to conduct a thorough benchmark. What I did is to evaluate 10^8 likelihood terms of a Poisson model. The number of terms was held constant, but I split that into different iteration chunks. Here are the results for the scaling of this:

The runtime of this is also nice to see:

reduce-sum-runtime

I am very pleased by these results! What they mean is: Large problems are speed up almost perfectly… and large starts at problem sizes of 1E4 and 1E3 is already largish. What is even nicer is that just 500 likelihood terms can get serious speedups with this… and even 100 likelihood terms run ~50% faster on 2 cores.

Now… this does not mean we will get such huge speedups in end-to-end Stan model runs… but from all what we know about Stans performance we should see a noticeable speedup.

There were concerns in the parallel design review that my proposal is not convincing in terms of the benchmarks shown so far… I think that these results show that for the class of eager AD evaluation problems (which includes most of our probability densities and more) the approach put forward is very much to the point.

Best,
Sebastian

wds15 · May 27, 2019, 8:57pm

Out of curiosity I ran the same benchmark as before, but this time with an incredibly cheap likelihood of a Gaussian density:

So even a super-cheap normal likelihood can run 3.5x faster on 4 cores! Moreover, this speeds up a 10^4 term likelihood by a factor of 2.2x using 4 cores.

All of this should translate into significant end-to-end performance gains whenever the problem size hits a range where performance can be scaled (happens earlier with more “complicated” things like the Poisson, but even the bread-and-butter Normal log-lik is seeing speedups)… and we hit that scaling regime really early, I think.

Bob_Carpenter · June 4, 2019, 4:27pm

I think it’s just waiting for a PR. I don’t now if precision has been decided, as it looks like we have some control over that with Boost.

Under what circumstances? Assuming my goal is a total of 1000 ESS run with four chains. Will it help if I only have 4 cores, like on my notebook?

wds15 · June 4, 2019, 5:31pm

I have filed a pr for that which is awaiting a review.

With 4 cores…you would have to switch to two chains and give each chain 2 cores. Then things should speedup since you get faster through warmup.

…the better option would be to buy a 8 core laptop (just came out from apple) and then assign 2 cores to each of your usual 4 chains…then the walltime used will be almost half as much.

So to really benefit you need spare cpus or go with fewer chains.

Bob_Carpenter · June 4, 2019, 5:57pm

I would’ve already bought one if they included a better keyboard. The rumor’s always that the next release will fix the keyboard problems. I’m not holding my breath. Meanwhile, my current (2012) laptop has a dying battery.

spinkney · June 4, 2019, 7:37pm

Don’t want to derail this thread but I have a 2017 MBP and I had the keyboard replaced once by Apple. Since then I haven’t had any problems and I’ve heard they’ve improved the design. From my experience the replacement was pretty seamless and everything else about the laptop has been great. I’d say go for it.

Topic		Replies	Views
Design doc on parallel autodiff from Sebastian Developers math	15	782	May 25, 2019
Parallel v3 map Developers	26	1029	February 2, 2020
Stanc3 optimization and analyses walkthrough during StanCon Meetings	6	1073	August 22, 2019
Within-chain parallelization idea (maybe crazy) Developers	35	2776	February 24, 2022
Gaussian process roadmap Developers gaussian-process	48	4451	April 12, 2019

Scaling of proposed parallel design stuff

Related topics