Multithreading with pystan3

Hello dear forum,

I am trying to run my Stan model with multithreading, but I saw that it’s a bit complicated with pystan2.18+. From this and other forums, seems like official multithreading is introduced with pystan3 (still Beta version), so I’ve decided to try it. Since my model is really really big, my goal is to parallelize each chain and to submit each chain to multiple cpus. However I couldn’t find anywhere how to implement it in pystan3.
Can someone here point me to the right direction?

Thank you!

We currently are using multiple processes in pystan3.

Do you want to use Stan multithreading functions?

That’s my goal and I can allocate a generous number of cpus for each chain. In pystan2.19, even if I follow this example, I still have only 4 cpus running (1 per chain). So something doesn’t work right…
Update - I use pystan on CentOS7 linux server.

So basically you need to define STAN_NUM_THREADS environmental variable and use map_rect function to use multiple cpu.

And in the compile step add

extra_compile_args = ['-pthread', '-DSTAN_THREADS']

So there is no way around rewriting my model to do multithreading? Even in pystan3?

No.

You need to use map_rect or reduce_sum (this is possible with CmdStanPy).

ok thanks! One last question - from reading more I saw that some folks suggest to increase the number of chains and to reduce the number of iterations in each chain to speed up. Is it really recommended?

between-chain parallelism is always more efficient than within-chain… but you have to go through the warmup for every chain…

Yes, I thought so. Thanks so much! Rewriting the model it is then!

I hope someone will pick it up - my really large model is stuck in the first set of warmup iterations for 8 hours. Any ideas what it means and how can I solve it?

How large is your model?

Is it large in parameters or data?

Thanks, Ari!
It’s not huge in parameters - I am fitting ~40 parameters. But the data is pretty large - it’s ~1200 observations (including missing data), each with ~10 different tasks with hundreds of trials each. So I guess that’s a lot.
Now the weird part is that in the original model I am looping over trials in each task and to make it more efficient, I vectorized it - BUT it takes more time after vectorization… hmmmm.

I see this since 8:40am EST

Gradient evaluation took 0.39 seconds
1000 transitions using 10 leapfrog steps per transition would take 3900 seconds.
Adjust your expectations accordingly!

Iteration: 1 / 1000 [ 0%] (Warmup)
Iteration: 1 / 1000 [ 0%] (Warmup)
Iteration: 1 / 1000 [ 0%] (Warmup)
Iteration: 1 / 1000 [ 0%] (Warmup)

That sounds interesting.

Maybe ask how to improve your model in a new thread. It is usually a good idea to debug a hard model with others.

Loops in Stan are C++ loops so sometimes vectorization doesn’t help.

I wouldn’t even know where to begin - it’s a 600 lines code… I’ll start with re-parametrizing it and with implementing the within-chain parallelization. Thanks a lot!

1 Like