Sampling in parallel using threads



I would like to draw samples in parallel from a Stan model on a multi-core computer using system threads. This is important for supporting parallel sampling on Windows in PyStan. (macOS and Linux can use fork to create independent processes.)

In pseudocode, I’m doing the following in four separate system threads:

stan_model * model = new stan_model(var_context)
return_code = hmc_nuts_diag_e(*model, init_var_context, random_seed, ...)

With STAN_THREADS defined this works (instead of crashing, as it did before stan-math PR #509) but it produces samples at the same rate as if I had drawn the samples serially. The independent cores are not being used to produce draws in parallel.

Is this expected behavior? Should I be able to use system threads to draw samples in parallel faster than I would be able to draw the same number of samples serially?


If you define STAN_THREADS then the AD stack is treated thread local. This then allows you

  • to run stan models inside threads (so different chains can run in parallel using threads)
  • you can take advantage of parallelization within a chain if your stan model uses the new map_rect

What you describe sounds like there is no performance degradation when switching on STAN_THREADS which is great as I would expect this to happen. The thread local thing adds overhead to everything, but that overhead will be model/platform/compiler specific.

I hope this helps and makes sense.


What you’ve described is what I anticipated happening. In practice, I’m not able to get different chains to run in parallel using threads. Is there a test which verifies this functionality?

I’ve tried making some quick modifications to cmdstan to do parallel sampling in threads but I’m ending up with segfaults (it works). For example, I’ve rewritten cmdstan/main.cpp to be this:

#include <cmdstan/command.hpp>
#include <stan/services/error_codes.hpp>
#include <boost/exception/diagnostic_information.hpp> 
#include <boost/exception_ptr.hpp> 

#include <thread>         // std::thread, remember to compile with -pthread

int main(int argc, const char* argv[]) {
  try {
    std::thread second (cmdstan::command<stan_model>, argc, argv);
    cmdstan::command<stan_model>(argc, argv);
    second.join();                // pauses until thread finishes
    std::cout << "thread finished.\n";
    return 0;
  } catch (const std::exception& e) {
    std::cout << e.what() << std::endl;
    return stan::services::error_codes::SOFTWARE;

edit: it all works. false alarm


I’m sorry. I forgot to define STAN_THREADS. Let me keep trying for a moment.


I spoke too soon. It works fine when the threads are created in C++. The source of the problem must be that I’m creating the threads in Python – or somewhere else.


What did you mean by system threads?



On Linux and macOS threads are POSIX threads, I think. On Windows it’s something different.


Python threads aren’t threads in that sense, hence my question. From your description it sounds like the GIL isn’t being released, while wrapping a C++11 thread and releasing the GIL would work.


Python threads are threads in this sense.

I got it working. With the GIL released, things work as expected.