What's Parallelizable in the Math Library

Hello,

After a first pass/skim through the math library, here’s some quick notes. Any comments are appreciated, but I’m mostly talking to myself. I’m talking about multithreading, and not just running a different process on different cores.

In general, anywhere we have a sum, this is parallelizable, over different partitions of the sum. I’m not sure how much speed this would add, but if something is used a lot, it could be helpful.

Some functions call the Gamma function 3-4 times, and there could be multiple threads handling the gamma function computation, and for example, in Bessel First Kind, and after threads are collected you can compute the fraction and return.

So, getting ahead of myself, since there are some “non-trivial” parallelization, from a user-facing perspective, it might be good to abstract away the number of threads from the user and have a toggle switch for threading. For example, if we’re sending multiple threads to compute the gamma function, we can send 3 at once, for example in the beta function, but the user wouldn’t be able to specify threads a-priori.

Where to start?

So we’re using a discrete Fourier transform, and internally, it calls the fft algorithm, which is a classic example of parallelization. Looks like we’re using something from boost that’s unsupported, but this could be a good starting point since we have a reference.

But then, after thinking about it, if we’ve overloaded the addition operator, i.e. add if this is used extensively downstream couldn’t multithreading it add big performance gains, rather than something that sparsely used?

Ignoring some trivial functions, like addition, here are some unabridged notes on what can be multithreaded within math/prim (corrections appreciated)"

  • bessel_function_first_kind - parallelize the calls to gamma, reduce and compute fraction?
  • beta - parallelize the calls to gamma
  • DTF - we’re already calling the FFT, which is a parallelized algorithm, but if we’re
    feeling it, and we want to use multithreading instead, we can parallelize the calls to DFT
    using TBB and see if we can get performance gains. I’m not sure how it’s implemented in BOOST
  • block - interesting case. For block diagonial matrices, since each block is “conditionally” independent
    we can apply multithreading to matrix parallelization algorithm, for example cholesky decomposition of a
    block diagonal matrix is equal to the cholesky decomposition of each of its blocks.
  • divide_columns - trivial, useful?
  • dot_product - possible, useful?
  • falling_factorial - thread the top and bottom gammas
  • matrix exponential - (it’s a series who’s parts can be computed disjointly and summed afterward)
  • matrix power - I guess you can compute sub-powers, Nth-power/)(nThreads), and then multiply afterward,
    this should give a speed-up
  • rep_* - the rep_vector, etc might be parallelizable, not sure if any serious gains, it’s an O(1) operation, copying a vector

~ Regards

Sums of N numbers can be done in \mathal{O}(\log N) time with N/2 processors. But that’s a lot of processors. The main issue you’re going to run into is that if you’re running multiple chains, you’re going to need a lot of cores to see the advantage of parallel evaluation within chain.

Within chain, there are two different kinds of parallelization. The first we’re getting with a lot of our operations already, and that’s AVX/SSE-type vectorization. That is, the CPU can evaluate 4 or 8 floating point operations in parallel in a lot of cases. Eigen, in particular, is very good at exploiting this.

Once you get to large numbers of fluid threads, you want to start thinking about something like the tbb library to manage them. It’s very inefficient to just create too many threads and have them contend. Overall, creation and destruction is expensive, hence the use of thread pooling in high performance applications.

I think the better approach is moving to JAX. There you get the advantage of being able to parallelize massively on the GPU as well as cross-core.

The FFT we’re using comes from Eigen and it’s much slower than the standard approach which was code generated, but is under too restrictive of a license. But yes, FFTs are something which can be parallelized, though they’re already an efficient \mathcal{O}(N \log N) algorithm in N dimensions.

Pretty much any matrix operation can be parallelized. Eigen’s already getting the vectorization right.

The big win’s for any of our vectorized operations. Like element wise application of exp and things like that. This all goes through higher-level functors in the math library.

I’d recommend concentrating on things that get used a lot.

Copying a vector is \mathcal{O}(N) for a size N vector. Moving a vector is \mathcal{O}(1), to put it in C++ lingo, but it destroys the source vector.