Hello,
After a first pass/skim through the math library, here’s some quick notes. Any comments are appreciated, but I’m mostly talking to myself. I’m talking about multithreading, and not just running a different process on different cores.
In general, anywhere we have a sum, this is parallelizable, over different partitions of the sum. I’m not sure how much speed this would add, but if something is used a lot, it could be helpful.
Some functions call the Gamma function 3-4 times, and there could be multiple threads handling the gamma function computation, and for example, in Bessel First Kind, and after threads are collected you can compute the fraction and return.
So, getting ahead of myself, since there are some “non-trivial” parallelization, from a user-facing perspective, it might be good to abstract away the number of threads from the user and have a toggle switch for threading. For example, if we’re sending multiple threads to compute the gamma function, we can send 3 at once, for example in the beta function, but the user wouldn’t be able to specify threads a-priori.
Where to start?
So we’re using a discrete Fourier transform, and internally, it calls the fft algorithm, which is a classic example of parallelization. Looks like we’re using something from boost that’s unsupported, but this could be a good starting point since we have a reference.
But then, after thinking about it, if we’ve overloaded the addition operator, i.e. add if this is used extensively downstream couldn’t multithreading it add big performance gains, rather than something that sparsely used?
Ignoring some trivial functions, like addition, here are some unabridged notes on what can be multithreaded within math/prim (corrections appreciated)"
- bessel_function_first_kind - parallelize the calls to gamma, reduce and compute fraction?
- beta - parallelize the calls to gamma
- DTF - we’re already calling the FFT, which is a parallelized algorithm, but if we’re
feeling it, and we want to use multithreading instead, we can parallelize the calls to DFT
using TBB and see if we can get performance gains. I’m not sure how it’s implemented in BOOST - block - interesting case. For block diagonial matrices, since each block is “conditionally” independent
we can apply multithreading to matrix parallelization algorithm, for example cholesky decomposition of a
block diagonal matrix is equal to the cholesky decomposition of each of its blocks. - divide_columns - trivial, useful?
- dot_product - possible, useful?
- falling_factorial - thread the top and bottom gammas
- matrix exponential - (it’s a series who’s parts can be computed disjointly and summed afterward)
- matrix power - I guess you can compute sub-powers, Nth-power/)(nThreads), and then multiply afterward,
this should give a speed-up - rep_* - the rep_vector, etc might be parallelizable, not sure if any serious gains, it’s an O(1) operation, copying a vector
~ Regards