General-purpose parallel_for_each with the Intel TBB?

wds15 · December 23, 2018, 12:13pm

Hi there!

I was always intrigued by the idea of having a general purpose parallel_for_each construct in the Stan language. With the Intel TBB in reach I thought - let’s try it. I have by now a prototype up and running for this and would appreciate a critical look at this by our core AD stack engineers. Let me first explain a bit.

The AD tape is written along with the function evaluation onto the global AD stack while with STAN_THREADS defined we endup having one AD tape per thread.
As the Intel TBB uses a thread pool we endup getting many AD stacks which are persistent in memory.
Executing things per thread with access to the thread-local AD stack is completely safe, since we access the local AD stack only.

The above is what makes our thread based map_rect work (the threaded versions does per thread execution and then injects the results into the main thread). However, no one stops us from referencing operands in the main thread from within the worker threads! As long as the threads stay around we can just grow their AD stack as we like and link them as we like. Once we have executed our parallel operation, the AD stack will endup being scattered in multiple AD tapes (one per thread). Thus, when we set adjoints to zero or loop over the stack to propagate the chain rule, we only have to ensure to loop over all the scattered AD stacks (and not just the main thread one).

All of the above can be easily expressed with the TBB. The TBB offers containers which hand out thread_local instances and you can iterate over all instances which belong to some thread.

The prototype I have done has these key bits:

Parallel for_each evaluation of a function which builds the AD tape in multiple threads: parallel_for_each_test.cpp
a AD stack based on Intel TBB’s container enumerable_thread_specific. This one gives me a thread_local AD tape and the possibility to iterate over all of them.
global versions of the set_zero_adjoints_global and grad_global functions. These perform their respective operation not only on the thread specific instances, but rather on all thread instances. See (here)[https://github.com/stan-dev/math/blob/feature/parallel_for_each/stan/math/rev/core/grad.hpp#L62], for example.

I hope I haven’t mess this up! How to actually roll this out to the language in a safe way still needs a bit of thinking. I am not sure if nested parallelism would work (probably not with this scheme) - but what I can imagine is to run things in parallel and then combine the different thread specific AD tapes (like map_rect). Still, I think this is some progress in the direction of making a parallel_for_each loop possible eventually.

Best,
Sebastian

seantalts · January 21, 2019, 5:49pm

Just to recap a little - we talked Intel TBB over a couple times in the meeting, and I think right now we’re looking for answers to these questions (which is my attempt to write up the general questions we’d like to ask ourselves before adding a dependency to the Math library):

I think we should separate Math library discussion from Stan language design discussion (the latter of which should probably be eventually formed into a design doc). For the Math library, I’m personally really excited about the efficiency gains with thread pooling and constructing code such that users don’t need to shard manually anymore (which sounds like a Stan language issue, but if we no longer need to chunk work up because the TBB work-stealing algorithm handles this much more efficiently, than users can already begin writing easier code using the existing map_rect). @wds15 do you know if this last bit is right?

There are also some autodiff hygiene things to consider here, though perhaps nested autodiff and parallelism both get easier with this PR introducing a global autodiff instance for each thread (and you could imagine swapping that out when you start doing nested autodiff even without threaded parallelism going on).

wds15 · January 21, 2019, 8:54pm

Yes, this sounds right to me. So users will still need to think a bit about what they define as a job - but that is by far less important than it is now. I mean parallelizing a super-efficient vectorized expression is still not a smart thing to do (unless we talk about GB of data), but the user needs to define sufficiently large blocks of work which warrant meaningful parallelization. One should err on the more granular side now so that in most cases one can let the Intel TBB just do it’s work. The work stealing should automatically queue the work in a near optimal way… but the Intel TBB gives us even more. They also allow us to take care of cache locality. So we can tell the Intel TBB which work and data belongs together and it will align the work and the caches around that - really nice!

Yup, we have the same thought to swap out and in AD tapes since this thing is now an object to which we merely point to and this can exchange. I hope the PR you refer to can go in soon. I have a follow-up which I will file after that. This is a new design for nested AD which makes nesting not any more a global operation, but a more local one which is easier to deal with in a threaded environment.

seantalts · January 22, 2019, 9:15pm

Gotcha, so most users will still want to shard… Do you think we could (someday) change the existing implementation of map_rect to automatically do this vectorization for the user? i.e. automatically chunking and assuming their code will be written in a vectorized way? Or do we need a new map signature?

wds15 · January 23, 2019, 8:00am

OMG… I just realize… YES, the fully automatic thing is actually extremely easy with the TBB! When you convert a for loop into a TBB parallel_for loop, then you have to code up a functor which does not anymore process a single element from the loop, but rather a blocked_range. Thus, we can easily write a map_vec or so which will call a user-function which is expected to operate on a sub-slice of the input container. So given that slicing of containers is easy then such a function is very simple to write.

Here is the basic example from the parallel_for:

#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"

using namespace tbb;

struct Average {
    const float* input;
    float* output;
    void operator()( const blocked_range<int>& range ) const {
        for( int i=range.begin(); i!=range.end(); ++i )
            output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.f);
    }
};

// Note: Reads input[0..n] and writes output[1..n-1]. 
void ParallelAverage( float* output, const float* input, size_t n ) {
    Average avg;
    avg.input = input;
    avg.output = output;
    parallel_for( blocked_range<int>( 1, n ), avg );
    }

This means we can combine vectorized statements in the user-functor (which are super-fast) with automatic chunking of the work. Now I see what this means… and this is really big!!! We can then throw tons of data at these constructs and they will just scale up as we need it.

Combining parallelization with the efficiency of vectorization will make stan once again another animal!

EDIT: Once we have this map_vec (or let’s call it map_sliced?) we should absolutely also add a map_vec_reduce or at least a map_vec_lpdf, because such a blocked map will be enormously fast and the reduce operation should then also be parallel.

Topic		Replies	Views
Building Intel TBB on Windows with gcc 4.9.3 Developers	9	1416	February 25, 2019
(older) parallel AD tape ideas Developers	42	1175	November 4, 2020
Benefits of parallelization with a threadpool of the Intel TBB Developers	39	5434	October 25, 2019
Design doc on parallel autodiff from Sebastian Developers math	15	782	May 25, 2019
Threading, MPI, and TBB for users General performance	6	1332	December 16, 2019

General-purpose parallel_for_each with the Intel TBB?

Related topics