I was always intrigued by the idea of having a general purpose
parallel_for_each construct in the Stan language. With the Intel TBB in reach I thought - let’s try it. I have by now a prototype up and running for this and would appreciate a critical look at this by our core AD stack engineers. Let me first explain a bit.
- The AD tape is written along with the function evaluation onto the global AD stack while with
STAN_THREADSdefined we endup having one AD tape per thread.
- As the Intel TBB uses a thread pool we endup getting many AD stacks which are persistent in memory.
- Executing things per thread with access to the thread-local AD stack is completely safe, since we access the local AD stack only.
The above is what makes our thread based
map_rect work (the threaded versions does per thread execution and then injects the results into the main thread). However, no one stops us from referencing operands in the main thread from within the worker threads! As long as the threads stay around we can just grow their AD stack as we like and link them as we like. Once we have executed our parallel operation, the AD stack will endup being scattered in multiple AD tapes (one per thread). Thus, when we set adjoints to zero or loop over the stack to propagate the chain rule, we only have to ensure to loop over all the scattered AD stacks (and not just the main thread one).
All of the above can be easily expressed with the TBB. The TBB offers containers which hand out thread_local instances and you can iterate over all instances which belong to some thread.
The prototype I have done has these key bits:
for_eachevaluation of a function which builds the AD tape in multiple threads: parallel_for_each_test.cpp
- a AD stack based on Intel TBB’s container enumerable_thread_specific. This one gives me a thread_local AD tape and the possibility to iterate over all of them.
- global versions of the
grad_globalfunctions. These perform their respective operation not only on the thread specific instances, but rather on all thread instances. See (here)[https://github.com/stan-dev/math/blob/feature/parallel_for_each/stan/math/rev/core/grad.hpp#L62], for example.
I hope I haven’t mess this up! How to actually roll this out to the language in a safe way still needs a bit of thinking. I am not sure if nested parallelism would work (probably not with this scheme) - but what I can imagine is to run things in parallel and then combine the different thread specific AD tapes (like
map_rect). Still, I think this is some progress in the direction of making a
parallel_for_each loop possible eventually.