Using Stan on a computing cluster. Any advice?

What’s the context here?

If you’re doing a simulation study you can fire off as many independent model fits as they’ll give you cores (check the memory footprint of your runs first so you can request enough memory but usually you won’t max out a single machine). With CmdStan it’s trivial to do 1 chain per core and collect results after (just encode enough in the filename so you can recombine after. rstan is fine for post-processing the results but if you can’t get it on the cluster and want to post-process there, there are a few R packages kicking around for dealing with CmdStan output (including mine:, it runs CmdStan models in parallel, reads and processes output (no diagnostics yet…), and because I have done a simulation study with this, it auto-generates qsub/bsub scripts…).

If you’re running some big model and trying to make it go faster you probably want to look into the threading functionality for within-machine parallelism (there can be quite a few cores on a cluster node, currently on develop branch only) and/or wait on the MPI functionality (that lets you spread work among machines (@wds15 has put in an incredible amount of work to get it working but it might be a month or three before it gets fully merged into develop).

So yeah, without much context that’s it.


Cool! That looks like a useful package. In terms of diagnostics, could you just read it in using read_stan_csv and then throw_sampler_warnings?

I think such a package would be useful to have in the stan repository and build it out. I have created such scripts myself and it sounds like others need this as well. Have u thought about integrating it with batchjobs?

I really hope that we get sooner than later MPI. The final PR for stan-math is up for review right now. Once that is in a small PR for cmdstan is needed to get this working. It feels very close.

@wds15 wrote re:, current branch:

tl;dr: I agree that’s why I wrote the package. I enjoy working with others and this code could move in the right direction but it has a narrow feature set and OSS cat-herding is hard.

The survivalstan package discussion is currently demonstrating the problem well. Everybody has written “something”, nobody wants to fold and actively contribute to something else but nobody has the time to do it themselves. It’s OSS chicken.

My goals for Stannis were to have something that installed easily on clusters that had R but might make it hard to install rstan. The development plan is: 1) Write well documented functions; 2) just return lists of stuff*; 3) drop rather than add dependencies; 4) re-design around well-defined objects once the re-design is obvious. This is mainly because I just need it to run my models without being a time-sink.

The main features are 1) run CmdStan based on a .yaml file; 2) save all input and output together while avoiding clobbering; 3) single-machine job-scheduling (so I can say “run 100 models on 4 cores” and come back later to look at what happened);

Cluster-level job-scheduling is the one feature I’d like to add for real (the current stuff is limited if useful). I haven’t had the time to dig in. I do not want to re-write their functionality or have them as a required dependency, but an optional run-time dependency would be perfect.

If you (@wds15 or anybody else) are interested in contributing I’m happy to have a more in-depth discussion about what would or would not work with this particular package.

*Sticking to lists means I’m punting design issues down the road, keeping only a few consistent types of lists around means there’s hope for a smooth re-design around classes later.

1 Like

Missed this: this is really cool. I’ll try to pay attention if Bob doesn’t have time to review and make some time myself. Limited MPI experience (toy projects) but this would make my life much easier so it’s worth the background.

We really need to make some noise once it’s properly in about how much it’s a killer feature for MCMC in general (because HMC pushes so much computation within-iteration).

Thanks @sakrejda ,
i have a multilevel model where I try to fit ~500K measures with two crossed factors (366x82 levels, ~16 repetitions for each combinations). On my workstation the model takes more than 1 week before crashing.
Definitely I will try with the first method that @sakrejda described: I will submit 4 array jobs, one chain for each core.
I’ll look forward the MPI implementation, so keep rockin’ guys (@wds15) !

With a big model like that I’d make sure that I had run it in a smaller context and gotten a good step size, no divergences, and a good acceptance rate distribution before scaling up computation.

1 Like

@sakrejda definitely YES! I am actually working on running the model on smaller chunks of data on my workstations and understand the RAM usage so to balance my requests on the nodes on the cluster. Any hints or good practices about checking the memory usage?

Yes, MPI is amazing… if your problem amends to it. So the computation cost per unit over which we parallelize must be rather large. ODE problems are an obvious win.

Hmm… I just recently had a case where I had 35k observations and I tried the parallelization thing using threads with no success. For this problem it turned out that cleverly arranging the data to take advantage of vectorization was they key. These types of problems should benefit a lot from the openMP branch which we are looking into right now. I will work on that once MPI is in.

@wds15 what do you mean for “cleverly arranging data”?

The problem was a multiple imputation problem where each subject was assigned to 1 of 9 cases. Thus the original program involved a loop over the 35k patients and decided with an if statement on the case 1-9 and add the respective likelihood contribution for the given patient. What I did is to sort the data by case and then do 9 additions to the likelihood (roughly speaking) where each likelihood addition is vectorized. That gave the model a major speed bump (almost not computable to 10min). These huge loops are ideal for parallelization using openMP, I think/hope.

1 Like

Thank you for your response(s).

Would it be possible to present just a simple example of how you can add likelihoods in Stan and how you vectorize them?

Thanks in advance,

Likelihoods can be added to the target density like anything else. ~ does it implicitly.

The very first example in the manual discusses vectorization, e.g., allowing y ~ normal(mu, sigma); where any or all of y, mu, and sigma are vectors.

I also had some questions about running Stan on a cluster, in my case PyStan, so I thought I’d add to this thread instead of starting a separate one.
My understanding is that PyStan uses Python’s multiprocessing library to run parallel chains so running the sampling function using n_jobs=N would require os.cpu_count() == N at least.

My question is the following, for the case above are there any performance benefits in having more than N processors available?

I ask because I need to request a specific number of nodes or processors for any job submitted to the cluster, and it’s not a big deal to require a few more for a moderate performance gain, but I don’t want to overuse their resources…
Thank you.

I do not have experience with PyStan and on the cluster available to me I use CmdStan.
However, I do not see particular benefits in terms of speed: I just delegate a cluster to run my chains avoiding to hang my PC at work. However figuring out how compiling stan on the cluster took a a bit of work…
I know that there is a lot of improvement in the “parallel Stan’s universe” (map_rect, GPUs) with a lot of work in the math library, but this I still can not understand how these new opportunities work.

1 Like

Thanks. Luckily they had no problem having PyStan installed upon request, and I’m hoping to be lucky enough that all build utils are all there too for model compilation.

At this point I’m just trying to run independent chain in parallel and wondering if the shared resources would affect the total speed, but you’re saying you don’t think so. So far I was able to run most individual analyses on the my desktop, where there’s plenty of memory and cores to spare, so I wouldn’t see any slowing down if Stan needed “additional” cores besides the ones running each of the n_jobs.
I just want not have those slow down when I move them to nodes where I have to predefine the resources.

I’m also looking forward to more low level parallelization, like GPU optimized large matrix operations and other vector mapping parallelization. For the more core functionalities of Stan that’s really the job of the developers, but it would be great to be able for instance to implement CUDA code for the model computations. Either way that may free up some bottlenecks in inference. Thanks again for the help.

Yep: I am looking forward too.
In particular the GPU speed-ups because almost every workstation have a GPU card mounted on. And I use a workstation to build the prototypes of my models that will later run on the cluster. Basically, I am trying to translate some spatial models (that I run with INLA) to STAN.
At this time, I think the OpenCL functionalities are still at a development stage.
If you find something interesting that improve the speed of a model, please, share your advices and considerations with all the community!

Multi-core parallelism is much more general for Stan models as we can parallelize any likelihood with conditionally independent data points. That’s already out in 2.18.

The first GPU functionality for Stan 2.19 will be based on OpenCL, because not everyone has an NVIDIA graphics card (most cluster do).

@rok_cesnovar and @Erik_Strumbelj are looking into CUDA support, but don’t expect it soon.

GPUs aren’t magic—they’re just helpful in the case of large dense matrix ops, which is what we’re (Rok, Erik, and @stevebronder and @seantalts) are implementing. All of the MCMC, optimization and VI are very efficient and hard to parallelize—basically just a bunch of dot products that are very small in scale compared to the log density calculations. If we get more serious about dense metric estimation, then we might be able to roll in GPU functionality. Otherwise, the cost to ship a vector to the GPU for a dot product would likely not be a win.

They’re already getting merged into our develop branch. The first batch of GPU-enabled matrix ops (matrix-matrix multiply, Cholesky factorization) will be out in 2.19 (perhaps more, depending on how long we wait).

@anon75146577 is the one to consult about that—he works on INLA and on Stan (not all caps as it’s not an acronym).

1 Like

Thanks as always for the answer.
I will wait for @anon75146577’s work (and the many other) working in getting GPs faster.
What interests me most of all is not only having correct inferences, but also learning the mathematical theory that in many packages is delegated to black-boxes: in this Stan (this time is written correctly, thank you for pointing it out) excellent tool for a newbie like me.
So I look forward to a mature, stable code and some examples of good, tested practices (to which I hope to contribute!)

This is a really important point for beginners (such as myself). I spent the last few days trying to figure this out, and I didn’t realize that MPI was for parallelization within-chains, not between-chains. Thanks!