Hello everyone
I have a few general questions that I would appreciate if anyone with information on this topic could answer (most likely these questions have arisen or will arise for many people).
My questions are about parallelization in Stan’s models, so I would be happy if anyone has information on this topic (even outside of the questions I raised) could answer it in this post so that everyone can benefit.
I must say that I am aware that in the model fitting function itself (such as the “stan() function” in the “rstan” package or the “mod$sample() function” in the “cmdstanr” package) it is possible to parallelize with the help of an option (such as “cores = getOption(“mc.cores”, 20)”).
-
But is parallelization possible except within the model, with functions such as (reduce_sum)?
-
Is there much difference in terms of execution speed? Between parallelization alone within the model versus parallelization within and using functions such as “reduce_sum” simultaneously.
-
And if we want to do parallelization with the help of some functions or packages in Stan, how do I do it? (I mean a clear and good guide that explains step by step how to perform parallelization)
Thank you in advance for your response.
Hey Mohammad, I’ll give it a crack. The parallelisation arguments such as cores
refers to the number of cores used to fit the model, but without within-chain parallelisation options you’re essentially capped by the number of chains. For instance, if you set chains = 4, parallel_chains = 4
, cmdstanr runs 4 chains in parallel. More chains obviously means more posterior draws and you can check for convergence with different initial values.
Within-chain parallelisation can split the work of a single chain across multiple cores, so if you have lots of observations you can spread, for instance, the log likelihood computations of all of your observations across multiple cores. For instance, if you have a reduce_sum()
in your Stan program and you set chains = 4, parallel_chains = 4, threads_per_chain = 5
(assuming you have 20 cores), each chain is able to use 5 cores for the computations. There’s a trade-off as there’s overhead involved with reduce_sum()
, but you should see substantial speed improvements if you have lots of observations.
As for examples, there are some in the User’s Guide.
1 Like
Yes, but the speedup is going to vary. As one extreme case, imagine I have a pharmacokinetic model that has to solve a differential equation between every dose of a drug a patient gets in a clinical trial. Being able to solve those diff eps on different cores or in different threads is huge because there is a huge amount of work being done by the solver with almost no communication overhead. In these cases, the speedups are proportional to the number of cores.
The problem you will find is that as you try to run more things in parallel, you start getting more compute and memory contention. For example on my old iMac, running 4 parallel chains twice, one group of four after the other, takes roughly the same time as running 8 parallel chains. That’s because with 8 parallel chains, there’s enough memory contention to double compute time.
We can parallelize in a lot of ways. We have some core OpenCL parallelization in our basic algorithms that’s available when multithreading is turned on. We can scale out to multiple machines over multiple cores using MPI, or we can scale up on a single machine with multiple threads using multithreading. We can parallelize by running multiple Markov chains in parallel (which is very easy, but can lead to memory contention), or by parallelizing each execution.
But what you’ll find is that Stan is not very good at parallelization. If you’re looking to set up some massively parallel code on GPUs or even on multiple cores, I’d suggest looking at PyTorch or JAX. You can find samplers in the Blackjax package.
1 Like
Hi
Thank you very much for your answer. I have solved my problem now.
I am implementing a change point model on survival data and I thought my model was running slow, but my modeling was wrong and now that I have corrected it, it is very, very fast.
Hello
Thank you very much for your answer. I have solved my problem now.
But I agree with what you said “Stan is not very good at parallelization” and I tried to implement parallelization several times and I faced some challenges and it was not easy at all.
I meant not very good in performance! It’s also a pain to code.