I would like to draw samples in parallel from a Stan model on a multi-core computer using system threads. This is important for supporting parallel sampling on Windows in PyStan. (macOS and Linux can use fork to create independent processes.)
In pseudocode, I’m doing the following in four separate system threads:
stan_model * model = new stan_model(var_context)
return_code = hmc_nuts_diag_e(*model, init_var_context, random_seed, ...)
With STAN_THREADS defined this works (instead of crashing, as it did before stan-math PR #509) but it produces samples at the same rate as if I had drawn the samples serially. The independent cores are not being used to produce draws in parallel.
Is this expected behavior? Should I be able to use system threads to draw samples in parallel faster than I would be able to draw the same number of samples serially?
If you define STAN_THREADS then the AD stack is treated thread local. This then allows you
to run stan models inside threads (so different chains can run in parallel using threads)
you can take advantage of parallelization within a chain if your stan model uses the new map_rect
What you describe sounds like there is no performance degradation when switching on STAN_THREADS which is great as I would expect this to happen. The thread local thing adds overhead to everything, but that overhead will be model/platform/compiler specific.
What you’ve described is what I anticipated happening. In practice, I’m not able to get different chains to run in parallel using threads. Is there a test which verifies this functionality?
I’ve tried making some quick modifications to cmdstan to do parallel sampling in threads but I’m ending up with segfaults (it works). For example, I’ve rewritten cmdstan/main.cpp to be this:
I spoke too soon. It works fine when the threads are created in C++. The source of the problem must be that I’m creating the threads in Python – or somewhere else.
Python threads aren’t threads in that sense, hence my question. From your description it sounds like the GIL isn’t being released, while wrapping a C++11 thread and releasing the GIL would work.