RStan (PyStan) & MPI / GPU

sakrejda · September 20, 2017, 8:12pm

I’ve played with these before so I’m pretty sure you’re not quite right about this. Here’s a tutorial that confirms what I think:

https://solarianprogrammer.com/2011/12/16/cpp-11-thread-tutorial/

You can create as many std::thread objects as you want, you just need to join them at some point. I did this (more recently than the 2011 tutorial) with writing to postgres and it “works good”.

bgoodri · September 20, 2017, 8:46pm

That guy is doing stuff with static const int num_threads as a global like

1     ...
 2     static const int num_threads = 10;
 3     ...
 4     int main() {
 5         std::thread t[num_threads];
 6 
 7         //Launch a group of threads
 8         for (int i = 0; i < num_threads; ++i) {
 9             t[i] = std::thread(call_from_thread);
10         }
11 
12         std::cout << "Launched from the main\n";
13 
14         //Join the threads with the main thread
15         for (int i = 0; i < num_threads; ++i) {
16             t[i].join();
17         }
18 
19         return 0;
20     }

Has it since become possible to make the signature of the main function be

 int main(const int num_threads)

?

sakrejda · September 20, 2017, 8:51pm

Check out how threads are used here (number set in the constructor):

github.com

sakrejda/cmdstan/blob/develop/src/stan/interface_callbacks/writer/psql_writer.hpp

#ifndef STAN_INTERFACE_CALLBACKS_WRITER_PSQL_WRITER_HPP
#define STAN_INTERFACE_CALLBACKS_WRITER_PSQL_WRITER_HPP

#include <stan/interface_callbacks/writer/base_writer.hpp>
#include <stan/interface_callbacks/writer/psql_writer_helpers.hpp>
#include <pqxx/pqxx>
#include <ostream>
#include <chrono>
#include <vector>
#include <queue>
#include <string>
#include <thread>
#include <mutex>
#include <condition_variable>


namespace stan {
  namespace interface_callbacks {
    namespace writer {

This file has been truncated. show original

Relevant snippet:

     for (unsigned int i=0; i < n_threads__; ++i) {
            write_threads__.emplace_back(std::thread(&psql_writer::consume_samples, this));
          }

bgoodri · September 20, 2017, 9:37pm

So, we would put this

const int nthreads = atoi(std::getenv("STAN_THREADS"));

into functions like stan/math/prim/mat/vectorize/apply_scalar_unary.hpp and then do something like
http://www.alecjacobson.com/weblog/?p=4544
?

sakrejda · September 20, 2017, 9:52pm

Yeah, that should work, we could also take it as an argument (e.g.-for CmdStan).

sakrejda · September 20, 2017, 9:53pm

I think I could help figure out how to apply this but I’m finishing that nearly-done rstan branch first! :P

sakrejda · September 20, 2017, 9:56pm

I haven’t followed all the details here—are we figuring out a generic way of multi-threading specific function calls and making it play nice with auto-diff here? That would be awesome… even though it’s still single-machine, right?

jonah · September 20, 2017, 10:06pm

I think that’s the gist of it. And it would definitely be awesome.

I don’t think this gives you multiple machines but would that not be possible?

sakrejda · September 20, 2017, 11:07pm

Sure it’s possible but we’d have to set up the messaging ourselves which is “just” book-keeping… It should be “straightforward”… :)

bgoodri · September 21, 2017, 12:01am

Yes. Ideally, it would just work in Stan Math functions regardless of what they were being used for.

sakrejda · September 21, 2017, 12:09am

Cool, I’m on board with C++11 threads being the way to go rather than trying to ship another dependency.

bgoodri · September 21, 2017, 12:31am

Maybe std:async instead of std::thread but I agree it is worth trying to do without OpenMP.

sakrejda · September 21, 2017, 12:39am

I wonder if am easy way to start would be to just push all autodiff calculations to a separate thread.

bgoodri · September 21, 2017, 12:55am

I assumed it would be easier to start with double and int operations. I think Bob said that it order to do stuff like this for autodiff, a change has to be made that has like a 20% performance hit when done serially.

sakrejda · September 21, 2017, 1:03am

Oh I see, so what’s an example where you want to do it?

bgoodri · September 21, 2017, 1:06am

The example where I have been trying OpenMP is an _lpdf

wds15 · September 21, 2017, 7:17am

Sounds all great. From my intuition I would lean towards a thread pool type of implementation. This should avoid the overhead to create and destroy these threads again and again.

Do I understand this right in that we are opting for parallelism which interacts with the AD stack in a serial way? To me that would make a lot of sense.

sakrejda · September 21, 2017, 9:07am

Can you explain this in more detail? I understand that the tape ad works with can have independent chunks, such as when there’s a matrix operation and we use the nested memory allocation to implement that. So it seems like any nested piece could be shipped off to a thread while the rest of the calculation carries on. Or do you mean something like what Ben said, that internally many functions do double-only calculations for gradients and those could be parallelized. It seems like there are many possibilities with varying levels of complexity.

wds15 · September 21, 2017, 9:14am

I would start simple minded and expand on that. So in order:

parallelize double only computations; for example loops
parallelize tasks in a way such that we do not need to lock the AD tape (like step 1, but there are probably more things to do than for loops)

Once we got that working, we may expand to asynchronous AD calculations which require locking the AD stack occasionally. Going this way would give us immediate speedups and step 1 above should be darn simple to do (modulo learning to manage threads, etc.).

Bob_Carpenter · September 21, 2017, 7:24pm

The problem in that code is the array declaration—that needs to be a fixed static constant os that the memory can be allocated on the function stack. No reason you couldn’t do something like this:

vector<thread> t(num_threads);
for (int i = 0; i < num_threads; ++i)
  t[i] = thread(...);

I’d have thought you’d want to store a reference, but it’s actually copying into that array in the code above and would copy into the container here.

Topic		Replies	Views
Using Stan on a computing cluster. Any advice? CmdStan	20	5067	January 10, 2019
Threaded AD Developers math	39	1931	April 8, 2018
Proposed parallelism RFC - Stan language bits Developers	14	1048	July 9, 2019
Cmdstanpy, mpi speedup Developers	26	281	November 19, 2024
Cmdstan 2.18 MPI Modeling	36	3091	September 12, 2018

RStan (PyStan) & MPI / GPU

Related topics