Parallel autodiff v4

That could work, but I was thinking more in the sense of a custom Jenkins page (similar to what we have for performance testing) in which you would input the stan-dev/stanc3 branch and it would produce the three binaries and comment on the PR with the link to the binaries. Jenkins already has all that capabilites to build the three binaries as it does that for the releases. But lets not derail this thread with this CI discussion.

I am really interested where this difference between this Ben’s models and the ones Sebastian was running is coming from. Seems really weird.

I am really not concerned by this. These results still make sense to me. First, the problem size is still small if you can calculate this in just 170s on 1 core (btw, the reduce_sum thing is faster on 1 core for large sizes, which is nice). You should try 2 cores and 4 cores. From what I have seen is that getting 2x speedup on 2 cores is often the case - and that’s a huge gain! Getting good speedup for 4 cores is often very very hard… and this has to do with the amount of non-parallelizable workload which we have. If the fraction of the non-parallel stuff is sizeable, then you just cannot speed it up, refer to Amdahls law:

Whenever the parallel code block makes up 75% of the overall workload, then you can almost double the speed on 2 cores, but you can only run things 3x faster on 8 cores!!! With 95% workload in the parallel region 8 cores get you 6x only; and with 90% parallel workload just 5x or so.

So we are simply limited by the serial proportion of the program and that’s all just fine as we hit theoretical boundaries here.

Thus, we can be happy with what we have, really!

1 Like

The poisson distribution could be easy optimized internally, but how about testing something more complex:

functions {
  real BIVARIATE_POISSON_LAGUERRE_POLYNOMIAL_log(real y1, real y2,real alpha1,real alpha2,real rho11,real mu1,real mu2) {
  real la1;
  real la2;
  real a1a2;
  real rhosq;
  real rhoinv;
  real lambda1;
  real lambda2;
  real etaxsi1;
  real etaxsi2;
  real xsi1;
  real eta1;
  real xsi2;
  real eta2;

  la1 = exp(mu1);
  la2 = exp(mu2);
  a1a2     =  alpha1*alpha2;
  rhosq    =  square(rho11);
  rhoinv   =  1./(1.+rhosq);
  lambda1  =  (alpha1+rhosq*(alpha1+2))*rhoinv;
  lambda2  =  (alpha2+rhosq*(alpha2+2))*rhoinv;

  etaxsi1  = 1./(1.+la1/lambda1)*(1./alpha1);
  etaxsi2  = 1./(1.+la2/lambda2)*(1./alpha2);
  xsi1   = (y1 + 1 + alpha1) * etaxsi1;
  eta1   = (y1 + alpha1) * etaxsi1;
  xsi2   = (y2 + 1. + alpha2) * etaxsi2;
  eta2   = (y2 + alpha2) * etaxsi2;

  return   (   y1 * mu1 - lgamma(1+y1)
             + y2 * mu2 - lgamma(1+y2)
  )
  + lgamma(y1 + alpha1) - lgamma(alpha1)
  + lgamma(y2 + alpha2) - lgamma(alpha2)
  + alpha1 * log(lambda1) - (alpha1 + y1) * log(lambda1 + la1)
  + alpha2 * log(lambda2) - (alpha2 + y2) * log(lambda2 + la2)
  + log(rhoinv)
  + log(    1
            + 2 * rho11 * sqrt(a1a2) * (1 - eta1) * (1 - eta2)
            + rhosq * a1a2
            * (1 - 2 * eta1 + eta1 * xsi1)
            * (1 - 2 * eta2 + eta2 * xsi2)
  );
  }
}

from 2.5 BIVARIATE POISSON-LAGUERRE POLYNOMIAL MODEL
https://support.sas.com/resources/papers/proceedings11/355-2011.pdf

I thought the model in question so far has a bernoulli_logit_lpmf, no?

Anyway - the bivariate poisson laguerre ploynomial which you suggest is a lot better for parallelization, since it does call a few times the log-gamma function which is relatively expensive to compute. The bernoulli_logit_lpmf is very cheap computationally, such that I would expect much better speedups with something calling a ton of log-gamma’s.

I thought at least we should have a model, that not only needs lots of CPU cycles, but also does some autodiff, eg. memory operations and see how that scales.

@wds15 Well the experiment was that I multiplied the amount of total work and the grainsize by 10 and the scaling stayed the same between 1 and 8 threads.

So it’s like I hugely increased the amount of parallel work but things didn’t actually scale any better.

@andre.pfeuffer yeah it would be good to have a model that does more than just do a matrix multiply + add on random effects. Do you have one of these models with simulated data or something you can share here?

bivariatepoissonlaguerre.stan (1.6 KB) bivariatepoissonlaguerre.R (662 Bytes) dumpdata.R (8.2 KB)

Created a small test model. It just fits the distribution. I wanted to keep the dependencies low.

(Useless to run tests on my 9 yrs old notebook.)

1 Like

@andre.pfeuffer thanks for the model. I wrote a parallel version of it and ran a few benchmarks so we can add that to the collection here.

bpl.data.R (6.1 KB) bpl.stan (1.8 KB) bpl_parallel.stan (2.0 KB)

The base model has 1000 data points. To make the model slower I added a ‘rep’ argument to the data file. rep=2 means double the data length by replicating it twice.

rep = 10
grainsize = 1250

8 threads parallel 88s
1 thread parallel 179s
1 thread serial 214s

rep = 10
grainsize = 125

8 threads parallel 60s
1 thread parallel 171s
1 thread serial 214s

rep = 1
grainsize = 125

8 threads parallel 7.7s
1 thread parallel 18s
1 thread serial 18s
2 Likes

This looks nice to me.

I am not sure I would test 8 threads as the only parallel thing. I would use 1, 2, 4 cores and maybe even 8. Doubling the speed with 2 cores is often easily doable - and I like to verify that - and what you get with 4 cores is very interesting to me as here it already starts to show how well things scale for a given problem. Getting efficient scaling with 8 cores would require a very high percentage of the parallel workload in relation to the overall workload as Amdahls law shows.

I still haven’t been able to get this to be efficient on any of the problems that I coded up even as I make the parallel work very large.

I thought there might be a performance problem with the threadlocal variables. That is not the case. I did some tests and convinced myself that isn’t right.

I made a little change to how the deep copies work. There was some overhead in allocating std::vectors that was a bit weird (https://github.com/stan-dev/math/pull/1616/commits/7a1a27b589debf44d42e6144bb614d0a97bd286b).

I coded up some a benchmark in test but couldn’t really get any weird behavior out of it. The idea was that there are three parameters to a calculation:

  1. The total number of parallel calculations
  2. The grainsize
  3. The size of the work

And I tried to characterize performance with this benchmark.

I think I need to add another which is number of parameters. Maybe that’s what I’ll do next. Just wanted to post an update and say I’m still not sure what’s going on.

#include <stan/math/prim/core.hpp>
#include <stan/math.hpp>
#include <gtest/gtest.h>
#include <algorithm>
#include <sstream>
#include <tuple>
#include <vector>

std::ostream* msgs = nullptr;

template <typename T>
struct count_lpdf {
  count_lpdf() {}

  inline T operator()(std::size_t start, std::size_t end,
                      const std::vector<int>& sub_slice, std::ostream* msgs,
                      const T& lambda,
                      int N) const {
    using stan::math::var;

    var sum = 0.0;
    for(int j = start; j < end; j++) {
      var lambda_mult = sub_slice[j - start] * lambda;
      for(int i = 0; i < N; i++) {
        sum += lambda_mult;
        lambda_mult *= lambda;
      }
    }

    return sum;
  }
};

TEST(v3_reduce_sum_benchmarks, reduce_sum_small) {
  using stan::math::var;

  stan::math::init_threadpool_tbb();

  std::vector<int> datasizes = { 1024, 4096, 16384 };
  std::vector<size_t> grainsizes = { 8, 16, 32, 64, 128, 256 };
  std::vector<int> worksizes = { 8, 16, 32, 64, 128, 256 };

  std::cout << "which_parallel, datasize, grainsize, worksize, time" << std::endl;
  for(auto datasize : datasizes) {
    for(auto grainsize : grainsizes) {
      for(auto worksize : worksizes) {
        std::vector<int> data(datasize, 1);

        var lambda_v = 0.5;

        double time = omp_get_wtime();
        var poisson_lpdf = 0.0;
        for(int i = 0; i < 100; i++) {
          poisson_lpdf += stan::math::reduce_sum<count_lpdf<var>>(data, grainsize, msgs, lambda_v, worksize);
        }
        std::cout << "reduce_sum, " << datasize << ", " << grainsize << ", " << worksize << ", " << omp_get_wtime() - time << std::endl;
      }
    }
  }

  stan::math::recover_memory();
}

Did you have a look at This

https://software.intel.com/en-us/node/506060

You should try more compute intense things like calculating log gamma, for example. Also, I would start with the deterministic scheduling to avoid additional complexity from the automatic scheduler to start with.

Edit: the automatic scheduler is interesting for tasks with varying Computational complexity or nested parallelism…both of which I would avoid when starting.

I specifically went for things that would stress the autodiff stack cause I was scared of threadlocal performance. With that test code though it’s pretty easy to get 7x speedups.

I guess two other types of models we could test would be:

  1. Models with lots of computation with little autodiff stack action (any of the linear algebra stuff with custom reverse mode, or any ODE example).

  2. Models that randomly accessed small pools of memory (like a regression with random effects does).

I tried the static scheduler once and this didn’t seem to be an issue.

Did you take into consideration what the Intel folks advise:

A rule of thumb is that grainsize iterations of operator() should take at least 100,000 clock cycles to execute. For example, if a single iteration takes 100 clocks, then the grainsize needs to be at least 1000 iterations. When in doubt, do the following experiment:

  1. Set the grainsize parameter higher than necessary. The grainsize is specified in units of loop iterations. If you have no idea of how many clock cycles an iteration might take, start with grainsize=100,000. The rationale is that each iteration normally requires at least one clock per iteration. In most cases, step 3 will guide you to a much smaller value.
  2. Run your algorithm.
  3. Iteratively halve the grainsize parameter and see how much the algorithm slows down or speeds up as the value decreases.

It’s needed to have each task being scheduled not being too small or equivalently it needs to be offset by the grainsize as described.

Ode solving things will scale great is my prediction (for example, a hierarchical ode problem).

I think so too. It would be nice to see it in action.

Yeah I tried all sorts of grainsize stuff. I couldn’t get an actual problem to run fast (no matter the grainsize).

The thing that’s a little bit different for us is that if we’re doing a bunch of primitive, non-vector ops we’re creating a lot of memory on the threadlocal stacks. I’m wondering if as we make grainsize larger, before it’s large enough for the TBB to be parallel, we start to get limited by some sort of memory speed thing.

Ofc. that’s just a guess and should be tested.

Maybe - but for a number of applications which involve heavy computations we already see nice speedups, I think.

From my perspective, we should gear up to merge what we have into develop (meaning to make the code ready with all the required stuff around testing + doc + review).

What are the things on your mind before we can proceed? To me the fundamental design makes total sense in terms of what is user-facing. The inner implementation is also good for a number of important applications (and can still be improved). Thus, I would really like to see this going into on of the next (maybe even 2.23) releases.

No. _lpdf computes normalizing constants, ~ does not. This really prevents my map_reduce code from performing.

I like the language. I’m happy with that too. It was really easy to write the three example models in this thread.

As far as preparing for release, we need to clean up all the C++ and document it and such. Also the language bit needs finished up and documented and tested.

I don’t want to put this in until we have a good explanation for why these models are slow though. We need to write docs for how people use this. It’s okay if this sucks for some problems, but we should be able to say for this type of problem, expect great speedups, and for this other problem, you’re going to be super limited in the level of parallelization you can do.

Right now I don’t know why these models are slow. Writing tests and docs is straightforward. The hard problem is figuring out why these models are slow and which models are fast.

For me this is already too useful to hold it back any longer. Many problems will benefit greatly from it - and we know that already now. I also don’t think that we need to be able to make perfect predictions for models which are fast or slow with this. I mean we know it’s useful for quite a number of cases.

In any case, getting better understanding would be good, of course. How about we collect the examples in an easy to execute manner and also have a wiki page on it?

My hunch is that it’s about the computational cost per unit being not adequate for the slow models (10^5 clock cycles is recommended for one task). What complicates things on top is the additional overhead due to the parallelization itself (shared arguments copying).

I think you are right in that this is a problem, but please do start a separate thread for this.

Bob we do this (target += … _lpdf) a lot when we move things into MapRect functions (unrelated to this specific parallel reduce thread). Is there a way to do this both inside MapRect (which I think means we can’t call ~) and not to compute normalizing constants that gives us the best performance characteristics? Thank you!