RStan (PyStan) & MPI / GPU

I’ve played with these before so I’m pretty sure you’re not quite right about this. Here’s a tutorial that confirms what I think:

https://solarianprogrammer.com/2011/12/16/cpp-11-thread-tutorial/

You can create as many std::thread objects as you want, you just need to join them at some point. I did this (more recently than the 2011 tutorial) with writing to postgres and it “works good”.

That guy is doing stuff with static const int num_threads as a global like

1     ...
 2     static const int num_threads = 10;
 3     ...
 4     int main() {
 5         std::thread t[num_threads];
 6 
 7         //Launch a group of threads
 8         for (int i = 0; i < num_threads; ++i) {
 9             t[i] = std::thread(call_from_thread);
10         }
11 
12         std::cout << "Launched from the main\n";
13 
14         //Join the threads with the main thread
15         for (int i = 0; i < num_threads; ++i) {
16             t[i].join();
17         }
18 
19         return 0;
20     }

Has it since become possible to make the signature of the main function be

 int main(const int num_threads)

?

Check out how threads are used here (number set in the constructor):

Relevant snippet:

     for (unsigned int i=0; i < n_threads__; ++i) {
            write_threads__.emplace_back(std::thread(&psql_writer::consume_samples, this));
          }

So, we would put this

const int nthreads = atoi(std::getenv("STAN_THREADS"));

into functions like stan/math/prim/mat/vectorize/apply_scalar_unary.hpp and then do something like
http://www.alecjacobson.com/weblog/?p=4544
?

Yeah, that should work, we could also take it as an argument (e.g.-for CmdStan).

I think I could help figure out how to apply this but I’m finishing that nearly-done rstan branch first! :P

I haven’t followed all the details here—are we figuring out a generic way of multi-threading specific function calls and making it play nice with auto-diff here? That would be awesome… even though it’s still single-machine, right?

I think that’s the gist of it. And it would definitely be awesome.

I don’t think this gives you multiple machines but would that not be possible?

Sure it’s possible but we’d have to set up the messaging ourselves which is “just” book-keeping… It should be “straightforward”… :)

Yes. Ideally, it would just work in Stan Math functions regardless of what they were being used for.

Cool, I’m on board with C++11 threads being the way to go rather than trying to ship another dependency.

Maybe std:async instead of std::thread but I agree it is worth trying to do without OpenMP.

I wonder if am easy way to start would be to just push all autodiff calculations to a separate thread.

I assumed it would be easier to start with double and int operations. I think Bob said that it order to do stuff like this for autodiff, a change has to be made that has like a 20% performance hit when done serially.

Oh I see, so what’s an example where you want to do it?

The example where I have been trying OpenMP is an _lpdf

Sounds all great. From my intuition I would lean towards a thread pool type of implementation. This should avoid the overhead to create and destroy these threads again and again.

Do I understand this right in that we are opting for parallelism which interacts with the AD stack in a serial way? To me that would make a lot of sense.

Can you explain this in more detail? I understand that the tape ad works with can have independent chunks, such as when there’s a matrix operation and we use the nested memory allocation to implement that. So it seems like any nested piece could be shipped off to a thread while the rest of the calculation carries on. Or do you mean something like what Ben said, that internally many functions do double-only calculations for gradients and those could be parallelized. It seems like there are many possibilities with varying levels of complexity.

I would start simple minded and expand on that. So in order:

  1. parallelize double only computations; for example loops
  2. parallelize tasks in a way such that we do not need to lock the AD tape (like step 1, but there are probably more things to do than for loops)

Once we got that working, we may expand to asynchronous AD calculations which require locking the AD stack occasionally. Going this way would give us immediate speedups and step 1 above should be darn simple to do (modulo learning to manage threads, etc.).

The problem in that code is the array declaration—that needs to be a fixed static constant os that the memory can be allocated on the function stack. No reason you couldn’t do something like this:

vector<thread> t(num_threads);
for (int i = 0; i < num_threads; ++i)
  t[i] = thread(...);

I’d have thought you’d want to store a reference, but it’s actually copying into that array in the code above and would copy into the container here.