I would like to draw samples in parallel from a Stan model on a multi-core computer using system threads. This is important for supporting parallel sampling on Windows in PyStan. (macOS and Linux can use
fork to create independent processes.)
In pseudocode, I’m doing the following in four separate system threads:
stan_model * model = new stan_model(var_context) return_code = hmc_nuts_diag_e(*model, init_var_context, random_seed, ...)
STAN_THREADS defined this works (instead of crashing, as it did before stan-math PR #509) but it produces samples at the same rate as if I had drawn the samples serially. The independent cores are not being used to produce draws in parallel.
Is this expected behavior? Should I be able to use system threads to draw samples in parallel faster than I would be able to draw the same number of samples serially?