RStan (PyStan) & MPI / GPU

features

#1

Hi!

With MPI we have a huge feature coming to cmdstan. I don’t quite see how this will be easy to integrate with our other interfaces which have a tight integration with stan such as rstan and pystan. However, a simple and straightforward approach could be to add to rstan/pystan functions which essentially call the cmdstan version and then import the results.

Does that sound sensible? Those wrappers do not need to support all options, but the most important ones. This would allow rstan/pystan to easily take advantage of MPI.

BTW, running from R the cmdstan version is what I do all the time for large runs which are crunched on our cluster. Later post-processing is then done in R, of course.

Sebastian


#2

There are tons of R packages that use MPI / OpenMP / GPUs, etc.

https://cran.r-project.org/web/views/HighPerformanceComputing.html

I think we will be able to figure something out, particularly if there are viable functions in Stan Math.


#3

I’m totally with Ben on this. rstan need fewer hacks not more.


#4

@bgoodri: I looked a bit into it and found it difficult to come up with a way to integrate it, but I haven’t explored it in too much detail. The problem is how to control the MPI environment when using boost::mpi. It would be good to integrate things with the Rmpi package, but that does not seem to go well with boost::mpi since both libraries rely on full control over mpi as I understood it.

@sakrejda: Sure, fewer hacks are better than more… but to me this is actually a missing feature of RStan. To run Stan on larger scale clusters a hybrid of RStan and CmdStan is ideal to do that. I have written my own scripts to do exactly that. So firing off CmdStan from RStan would be a very useful feature anyway.


#5

I guess I don’t know what you’re asking for then. A couple of us have packages for running cmdstan from R that are super simple. I just generate batch files because then you basically curry your runs off cluster and you can run them wherever and rely on a real scheduler rather than R. You get fault tolerance for free. Rstan has huge dependencies so it’s great not to have to get that on a cluster. What do you get from integrating with rstan?


#6

Related to this, @wds15 and @Bob_Carpenter: What is necessary on the Stan side to allow us to use OpenMP pragmas inside all the _lp{d,m}f functions that accumulate their derivatives analytically? I can get it to compile and it runs fine if I don’t utilize OpenMP, but if I put #pragma omp parallel for before a loop in the C++ code, it terminates when an exception is thrown due to underflow, as in

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
  what():  Error in function boost::math::lgamma<long double>(long double): Evaluation of lgamma at 0.
Aborted

#7

What you describe sounds OK to me. I was able to get OpenMP working once I managed to not touch the AD stack during execution of parallel regions. I have never seen this error, but maybe some bits of the AD stack are touched as it looks like. If you have code to checkout, let me know.


#8

I know literally nothing about OpenMP pragmas. Maybe @seantalts or @syclik could help.

I would think if you’re trying to parallelize that loop, the problem would be synchronization. For example, considerl normal_lpdf(y | mu, sigma) where mu is a vector and sigma is a scalar. Each iteration over the elements of mu increments the same memory location for derivatives w.r.t. sigma.


#9

Pragmas are just compiler directives; I’m guessing the custom MPI compiler does something special with these. I think Sebastian is saying we will need the code generation backend to emit these pragmas when using MPI with certain components.


#10

I discovered that if you try to do something like

#pragma omp parallel for
for (int i = 0; i < N; i++) {
  // stuff
}

it will segfault, so what you need to be doing is like

#pragma omp parallel for
for (int i = 0; i < N; i++) {
  try {
    // stuff
  }
  catch {

  }
}

This is something we can do with distribution functions in Stan Math that work in doubles internally; it does not require changes to stanc.


#11

OpenMP and MPI… similar acronyms; completley different bag.


#12

Hi Ben!

Yes, I also learned the hard way that exceptions may never escape out of an openmp block. The way I solved that is here:

In short: You need to put the loop into a try block, catch everything and later rethrow any exception whiyh may have occurred.

I wonder how much speedup we can get from this.


#13

A lot, in cases with large amounts of data and likelihood functions with analytical derivatives. I got it to work once I checked for underflow outside the #pragma omp for and it is much faster now.

There is a case to be made for using C++11 threads / tasks instead of OpenMP, but as far as I can tell, the former does not allow you to control the number of cores used with an environmental variable.


#14

Multi-core on the same machine shoul be best with multi-threading as you can share memory and avoid serialization and transmission costs (though you still need to load memory you use). And you’re not hitting the OS process manager, either. The drawback for us is that we have global memory for the autodiff stack that needs to be made thread local and causes a perfromance hit.

Multi-core on multiple machines is probably the sweet spot for MPI. But the advantage of using MPI here is that we can also use it for multi-core on a single machine quite easily. And it keeps globals separate.

Same arguments go for running multiple Stan programs.


#15

OpenMP is multi-threading but it isn’t supported by XCode and isn’t very good on Windows for g+±4.9. So, while it easy to implement, it is really only helpful for people using Linux or Mac with a non-Xcode compiler and many more local cores than chains.


#16

Exactly. People like Sebastian.

I am drooling over the new iMac Pro. I’m OK with using g++ or a non-XCode clang++.


#17

I think we should try to get some native C++ thing working as we have a huge MacOS user base who will struggle with getting this to work. From reading a bit through the net I think we can go for a thread pool like detailed here:

This would allow us to fix the number of threads to a fixed number. These threads would constantly run and we assign them work to do once in a while.

The other reason to go with vanilla C++ is that we do not have to do weird hacks for exceptions… OpenMP is really great to get things quickly working, but after all I think it is a very low-level technique which introduces a heavy C like feeling to the code.

If you code on an experimental branch and want me to take a look; I am happy to do so.

This would be just awesome to have machine local threading and MPI combined (oh, I forgot the GPU)!

Just to add: OpenMP is fine by me; I just think a C++ only solution is preferable and we should try to get that running if we can with reasonable effort.


#18

Anything with the number of threads fixed at compile time is perhaps viable for the case where the users are compiling their own models locally. It isn’t viable for things like rstanarm. That is why I was thinking to put OpenMP stuff into Stan Math whenever working with doubles and ints and operations that do not throw exceptions.


#19

But the thread pool will allow you to nail the number of concurrent threads to a fixed number. So we either configure that with some configuration item (like options for the samplers) or we create an environment variable STAN_CORES which we can access in the constructor of the thread pool. That should work, no?


#20

My understanding is that the number of threads has to be known at compile time. So, we could have such an argument for stanc but what would we set it to when the program is compiled on CRAN?