with a number of tasks that is equal to the number of chains that I want to run in parallel, and giving more than 1 CPU per chain so to take advantage of map_rect?
I would like to add that I am using pyStan3. I am not sure how to make chains run in parallel and if I should specify how many cores per chain should be used.
Should I set some Stan environmental variable like STAN_NUM_THREADS ?
I think from the documentation it is not clear how chain parallelization should be done in pystan3.
One thing you can do is just fire up an instance of PyStan in different processes. That duplicates data memory compared to multi-threaded, but I’m not sure if PyStan has caught up to the internal threaded multiple chain of our C++ processes.
You might also want to look into cmdstanpy, which runs Stan out of process and communicates via I/O.