You can create as many std::thread objects as you want, you just need to join them at some point. I did this (more recently than the 2011 tutorial) with writing to postgres and it “works good”.
That guy is doing stuff with static const int num_threads as a global like
1 ...
2 static const int num_threads = 10;
3 ...
4 int main() {
5 std::thread t[num_threads];
6
7 //Launch a group of threads
8 for (int i = 0; i < num_threads; ++i) {
9 t[i] = std::thread(call_from_thread);
10 }
11
12 std::cout << "Launched from the main\n";
13
14 //Join the threads with the main thread
15 for (int i = 0; i < num_threads; ++i) {
16 t[i].join();
17 }
18
19 return 0;
20 }
Has it since become possible to make the signature of the main function be
I haven’t followed all the details here—are we figuring out a generic way of multi-threading specific function calls and making it play nice with auto-diff here? That would be awesome… even though it’s still single-machine, right?
I assumed it would be easier to start with double and int operations. I think Bob said that it order to do stuff like this for autodiff, a change has to be made that has like a 20% performance hit when done serially.
Sounds all great. From my intuition I would lean towards a thread pool type of implementation. This should avoid the overhead to create and destroy these threads again and again.
Do I understand this right in that we are opting for parallelism which interacts with the AD stack in a serial way? To me that would make a lot of sense.
Can you explain this in more detail? I understand that the tape ad works with can have independent chunks, such as when there’s a matrix operation and we use the nested memory allocation to implement that. So it seems like any nested piece could be shipped off to a thread while the rest of the calculation carries on. Or do you mean something like what Ben said, that internally many functions do double-only calculations for gradients and those could be parallelized. It seems like there are many possibilities with varying levels of complexity.
I would start simple minded and expand on that. So in order:
parallelize double only computations; for example loops
parallelize tasks in a way such that we do not need to lock the AD tape (like step 1, but there are probably more things to do than for loops)
Once we got that working, we may expand to asynchronous AD calculations which require locking the AD stack occasionally. Going this way would give us immediate speedups and step 1 above should be darn simple to do (modulo learning to manage threads, etc.).
The problem in that code is the array declaration—that needs to be a fixed static constant os that the memory can be allocated on the function stack. No reason you couldn’t do something like this:
vector<thread> t(num_threads);
for (int i = 0; i < num_threads; ++i)
t[i] = thread(...);
I’d have thought you’d want to store a reference, but it’s actually copying into that array in the code above and would copy into the container here.