General-purpose parallel_for_each with the Intel TBB?

math

#1

Hi there!

I was always intrigued by the idea of having a general purpose parallel_for_each construct in the Stan language. With the Intel TBB in reach I thought - let’s try it. I have by now a prototype up and running for this and would appreciate a critical look at this by our core AD stack engineers. Let me first explain a bit.

  • The AD tape is written along with the function evaluation onto the global AD stack while with STAN_THREADS defined we endup having one AD tape per thread.
  • As the Intel TBB uses a thread pool we endup getting many AD stacks which are persistent in memory.
  • Executing things per thread with access to the thread-local AD stack is completely safe, since we access the local AD stack only.

The above is what makes our thread based map_rect work (the threaded versions does per thread execution and then injects the results into the main thread). However, no one stops us from referencing operands in the main thread from within the worker threads! As long as the threads stay around we can just grow their AD stack as we like and link them as we like. Once we have executed our parallel operation, the AD stack will endup being scattered in multiple AD tapes (one per thread). Thus, when we set adjoints to zero or loop over the stack to propagate the chain rule, we only have to ensure to loop over all the scattered AD stacks (and not just the main thread one).

All of the above can be easily expressed with the TBB. The TBB offers containers which hand out thread_local instances and you can iterate over all instances which belong to some thread.

The prototype I have done has these key bits:

I hope I haven’t mess this up! How to actually roll this out to the language in a safe way still needs a bit of thinking. I am not sure if nested parallelism would work (probably not with this scheme) - but what I can imagine is to run things in parallel and then combine the different thread specific AD tapes (like map_rect). Still, I think this is some progress in the direction of making a parallel_for_each loop possible eventually.

Best,
Sebastian


#2

Just to recap a little - we talked Intel TBB over a couple times in the meeting, and I think right now we’re looking for answers to these questions (which is my attempt to write up the general questions we’d like to ask ourselves before adding a dependency to the Math library):

I think we should separate Math library discussion from Stan language design discussion (the latter of which should probably be eventually formed into a design doc). For the Math library, I’m personally really excited about the efficiency gains with thread pooling and constructing code such that users don’t need to shard manually anymore (which sounds like a Stan language issue, but if we no longer need to chunk work up because the TBB work-stealing algorithm handles this much more efficiently, than users can already begin writing easier code using the existing map_rect). @wds15 do you know if this last bit is right?

There are also some autodiff hygiene things to consider here, though perhaps nested autodiff and parallelism both get easier with this PR introducing a global autodiff instance for each thread (and you could imagine swapping that out when you start doing nested autodiff even without threaded parallelism going on).


#3

Yes, this sounds right to me. So users will still need to think a bit about what they define as a job - but that is by far less important than it is now. I mean parallelizing a super-efficient vectorized expression is still not a smart thing to do (unless we talk about GB of data), but the user needs to define sufficiently large blocks of work which warrant meaningful parallelization. One should err on the more granular side now so that in most cases one can let the Intel TBB just do it’s work. The work stealing should automatically queue the work in a near optimal way… but the Intel TBB gives us even more. They also allow us to take care of cache locality. So we can tell the Intel TBB which work and data belongs together and it will align the work and the caches around that - really nice!

Yup, we have the same thought to swap out and in AD tapes since this thing is now an object to which we merely point to and this can exchange. I hope the PR you refer to can go in soon. I have a follow-up which I will file after that. This is a new design for nested AD which makes nesting not any more a global operation, but a more local one which is easier to deal with in a threaded environment.


#4

Gotcha, so most users will still want to shard… Do you think we could (someday) change the existing implementation of map_rect to automatically do this vectorization for the user? i.e. automatically chunking and assuming their code will be written in a vectorized way? Or do we need a new map signature?


#5

OMG… I just realize… YES, the fully automatic thing is actually extremely easy with the TBB! When you convert a for loop into a TBB parallel_for loop, then you have to code up a functor which does not anymore process a single element from the loop, but rather a blocked_range. Thus, we can easily write a map_vec or so which will call a user-function which is expected to operate on a sub-slice of the input container. So given that slicing of containers is easy then such a function is very simple to write.

Here is the basic example from the parallel_for:

#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"

using namespace tbb;

struct Average {
    const float* input;
    float* output;
    void operator()( const blocked_range<int>& range ) const {
        for( int i=range.begin(); i!=range.end(); ++i )
            output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.f);
    }
};

// Note: Reads input[0..n] and writes output[1..n-1]. 
void ParallelAverage( float* output, const float* input, size_t n ) {
    Average avg;
    avg.input = input;
    avg.output = output;
    parallel_for( blocked_range<int>( 1, n ), avg );
    }

This means we can combine vectorized statements in the user-functor (which are super-fast) with automatic chunking of the work. Now I see what this means… and this is really big!!! We can then throw tons of data at these constructs and they will just scale up as we need it.

Combining parallelization with the efficiency of vectorization will make stan once again another animal!

EDIT: Once we have this map_vec (or let’s call it map_sliced?) we should absolutely also add a map_vec_reduce or at least a map_vec_lpdf, because such a blocked map will be enormously fast and the reduce operation should then also be parallel.