This might only interest a few people (like @wds15), so fair warning.
I was asking during the meeting on Thursday if we’d looked into just using bare async tasks instead of batching and how the performance looked for that. I tried it out with a microbenchmark and it seems like we get significant wins, at least for relatively cheap matrix operations (transpose and multiply for a 50x50 matrix). You can see the code here (and I hope it serves as inspiration for answering other performance questions we all might have):
I tried to run this, but the cloning aborts with some permission denied messages… anyway, are u saying that chunking is the better thing to do as opposed to using async “all the way”?
I hope so as this was my conclusion from looking at things. This is also as to why I am keen on pushing this parallelization stuff onto other libraries like the Intel TBB. The TBB has been designed to handle efficient parallelization…including nested parallelism. Thus, I would not reinvent the wheel here given that there are decent solutions available.
And yeah, saying that the chunking thing we have in there now that you built is more performant. And I agree with your assessment - I think this is a great place to pull in a carefully chosen dependency.
Ah, I think I am having trouble with the git submodules which are also defined with the git URL while the ssh stuff works.
Ok, good that you come the same conclusion. The only trouble with the Intel TBB is that it is from Intel. Nothing bad about this, but that is why the Parallel STL could be a nice compromise. However, the Intel TBB offers a lot more control in comparison to the C++17 standard. Thus using the Parallel STL from Intel would be a compromise between taking advantage of the Intel TBB while not going all in with a vendor specific lib… but the TBB is around for quite some time now, has a good license and runs on all our platforms. So we could even opt for going all in with the TBB. We should consider those options. The parallel STL PR I put up is still with C++17 only so that it doesn’t make any choice there.
Changed the submodules too if you want to try again.
Is it bad that the TBB is from Intel as long as the license allows us to easily include it and it is performant on all our platforms (would need to check that they don’t screw over AMD chips I guess).
We should probably do a bit of research on the question of Intel TBBs performance on AMD. Obviously they only test it and optimize it for Intel, but as the TBB is used a lot in research projects this must be a question people have looked at, for sure.
So staying with the C++17 standard will give you more options as to what specifically will implement your parallelism (Intel Parallel STL/MSVC/HPX/Intel Compiler/GCC 9.0/…). This is the upside of the standard, but the downside is less control of details (and a few things are still missing).
From a quick read I haven’t found anything particular which speaks against the use of the Intel TBB wrt to AMD.
What is really tempting to try is to use the Intel TBB concurrent_vector as basis for our AD stack. If we get that to work then we could offer (potentially) a parallel_for stan construct… and that would open the door for a much larger user-base to take advantage of parallelization. While we have map_rect, you actually only apply it if you really need it (or just wanna try), but a parallel_for loop would be far more useful.
If the vanilla TBB is really an option, I am happy to think about this a bit - I think that a parallel_for would be a major step forward.
I don’t want to derail this thread too much, but can you say more about how you’re thinking about parallel for versus a parallel map (for the sake of argument, let’s assume we use the magic new compiler to create a much nicer parallel map API than we currently have with map_rect, but that it still takes a function and a collection).
The problem with map_rect is that it is very non-userfriendly… you end up having to pack your parameters and data into the signature as we provide it. The upside form this packing is that we know what to do for each job and we know what the operands are. This way we can run a nested AD sweep for each job. What I am proposing now in addition is to have a parallel_for facility which you can give any function object, a range to execute over and then we execute that in parallel. This version would not know what the operands are - and thus we cannot run a nested AD sweep per job. What happens instead is that we will write one AD tape per thread and link those together. This way we can give our users already now a very easy to use parallel execution thing. No need for a super-clever parser. However, a parallel map API with automatic analysis of the program is possibly even better, yes. Still - at least for our C++ users of the stan-math library it would be very convenient to have this parallel_for thing.
I think that makes sense. And yeah, I don’t want to compare with existing map_rect as I agree it’s not very user-friendly. I’m imagining if we had a more typical map function that you might find in other languages where it’s easy enough to pack your data into tuples and unpack them in the function definition, and worried because it seems like the parallel for concept would have potentially confusing rules that would be difficult to enforce and get right, in either the Stan language or the Math library. Could you enumerate the rules here for the efficient use of parallel for?
Hmm… maybe it makes actually sense to rename what I have done to be a parallel_map. The idea is that you apply an UnaryFunction over an iterator range which is mapped to an index set over the iterator range. Each function evaluation result is stored in a vector (-> array in Stan speak). The UnaryFunction can be a lambda with read-only references to anything in the global scope. The possibility to pass in anything from the global scope makes this user-friendly as it avoids the packing and unpacking. The function itself must be safe to be executed in parallel, which means that
No modification of any global variable (one could relax this a bit; because if you loop over i=1…N, then you can certainly access & modify structures in the global scope which are indexed by i in many cases).
No access to the target log-prob (which is a global variable, of course).
No access to random streams -> avoid calling random number generators
I think that should be it. So this could look like this:
Stan pseudo code
real foo(real container_elem, arg1, arg2, arg3) {
return result-fo-fancy-calc;
}
...
real arg1;
vector[10] arg2;
int arg3;
real container[500] = ...;
real map_result[500] = parallel_map(foo, container, arg1, arg2, arg3);
Or something like this. This way we can avoid the packing business and data stays data (so no overpromotion). Any additional operation (like a combine) would happen as a separate function.