I have the opportunity to run same data analysis on the computing cluster (MPI+GPU) of my institution.
Until now, I have written and run my models with RStan on my personal workstation.
On the cluster I will use instead CmdStan compiled by me.
Because I’m a first-timer, does any guide (or example) exists to compile and execute a model in such environment and to check that the computational resources are well exploited?
What’s the context here?
If you’re doing a simulation study you can fire off as many independent model fits as they’ll give you cores (check the memory footprint of your runs first so you can request enough memory but usually you won’t max out a single machine). With CmdStan it’s trivial to do 1 chain per core and collect results after (just encode enough in the filename so you can recombine after. rstan is fine for post-processing the results but if you can’t get it on the cluster and want to post-process there, there are a few R packages kicking around for dealing with CmdStan output (including mine: github.com/sakrejda/stannis, it runs CmdStan models in parallel, reads and processes output (no diagnostics yet…), and because I have done a simulation study with this, it auto-generates qsub/bsub scripts…).
If you’re running some big model and trying to make it go faster you probably want to look into the threading functionality for within-machine parallelism (there can be quite a few cores on a cluster node, currently on
develop branch only) and/or wait on the MPI functionality (that lets you spread work among machines (@wds15 has put in an incredible amount of work to get it working but it might be a month or three before it gets fully merged into
So yeah, without much context that’s it.
Cool! That looks like a useful package. In terms of diagnostics, could you just read it in using read_stan_csv and then throw_sampler_warnings?
I think such a package would be useful to have in the stan repository and build it out. I have created such scripts myself and it sounds like others need this as well. Have u thought about integrating it with batchjobs?
I really hope that we get sooner than later MPI. The final PR for stan-math is up for review right now. Once that is in a small PR for cmdstan is needed to get this working. It feels very close.
@wds15 wrote re:
github.com/sakrejda/stannis, current branch:
tl;dr: I agree that’s why I wrote the package. I enjoy working with others and this code could move in the right direction but it has a narrow feature set and OSS cat-herding is hard.
The survivalstan package discussion is currently demonstrating the problem well. Everybody has written “something”, nobody wants to fold and actively contribute to something else but nobody has the time to do it themselves. It’s OSS chicken.
My goals for Stannis were to have something that installed easily on clusters that had R but might make it hard to install rstan. The development plan is: 1) Write well documented functions; 2) just return lists of stuff*; 3) drop rather than add dependencies; 4) re-design around well-defined objects once the re-design is obvious. This is mainly because I just need it to run my models without being a time-sink.
The main features are 1) run CmdStan based on a .yaml file; 2) save all input and output together while avoiding clobbering; 3) single-machine job-scheduling (so I can say “run 100 models on 4 cores” and come back later to look at what happened);
Cluster-level job-scheduling is the one feature I’d like to add for real (the current stuff is limited if useful). I haven’t had the time to dig in. I do not want to re-write their functionality or have them as a required dependency, but an optional run-time dependency would be perfect.
If you (@wds15 or anybody else) are interested in contributing I’m happy to have a more in-depth discussion about what would or would not work with this particular package.
*Sticking to lists means I’m punting design issues down the road, keeping only a few consistent types of lists around means there’s hope for a smooth re-design around classes later.
Missed this: this is really cool. I’ll try to pay attention if Bob doesn’t have time to review and make some time myself. Limited MPI experience (toy projects) but this would make my life much easier so it’s worth the background.
We really need to make some noise once it’s properly in about how much it’s a killer feature for MCMC in general (because HMC pushes so much computation within-iteration).
Thanks @sakrejda ,
i have a multilevel model where I try to fit ~500K measures with two crossed factors (366x82 levels, ~16 repetitions for each combinations). On my workstation the model takes more than 1 week before crashing.
Definitely I will try with the first method that @sakrejda described: I will submit 4 array jobs, one chain for each core.
I’ll look forward the MPI implementation, so keep rockin’ guys (@wds15) !
With a big model like that I’d make sure that I had run it in a smaller context and gotten a good step size, no divergences, and a good acceptance rate distribution before scaling up computation.
@sakrejda definitely YES! I am actually working on running the model on smaller chunks of data on my workstations and understand the RAM usage so to balance my requests on the nodes on the cluster. Any hints or good practices about checking the memory usage?
Yes, MPI is amazing… if your problem amends to it. So the computation cost per unit over which we parallelize must be rather large. ODE problems are an obvious win.
Hmm… I just recently had a case where I had 35k observations and I tried the parallelization thing using threads with no success. For this problem it turned out that cleverly arranging the data to take advantage of vectorization was they key. These types of problems should benefit a lot from the openMP branch which we are looking into right now. I will work on that once MPI is in.
The problem was a multiple imputation problem where each subject was assigned to 1 of 9 cases. Thus the original program involved a loop over the 35k patients and decided with an if statement on the case 1-9 and add the respective likelihood contribution for the given patient. What I did is to sort the data by case and then do 9 additions to the likelihood (roughly speaking) where each likelihood addition is vectorized. That gave the model a major speed bump (almost not computable to 10min). These huge loops are ideal for parallelization using openMP, I think/hope.
Thank you for your response(s).
Would it be possible to present just a simple example of how you can add likelihoods in Stan and how you vectorize them?
Thanks in advance,