More of a discussion question than something to troubleshoot…
I’m running some Stan models from R (chains = 4, cores = 4). I wrapped the model implementation in a function that writes out each fitted Stan model to a unique directory.
I have about 50 models that I want to run. I usually code on a 72-core server accessed via remote desktop (that has RStudio installed). I also have institutional access to a large cluster that I can send jobs to, but I’ve never tried sending a Stan model to it.
What’s the optimal way to run a bunch of Stan models that each use multiple cores for the different chains? Is there a most efficient or recommended way to parallelize Stan runs? And how do I do the accounting for number of cores to parallelize over * number of cores each Stan model uses?
I’d be very grateful for general advice as well as specific R implementations that have worked for other people. Thank you!
I’m not exactly an expert on this, but given nobody else has chimed in, let me give it a shot.
When you say run, what do you want out of each model? The whole sample or just a posterior summary or something else?
If you’re going to run from R, I’d recommend running cmdstanr rather than rstan as it’ll be easier to install on a cluster. Even better, just use CmdStan directly as then you don’t need to worry about the R toolchain. But that’ll only get you the sample, not any downstream analysis.
I’m afraid we don’t have that implemented. What it would do is use a process or thread pool to load balance everything. Barring that, it’s going to be an empirical question of how good the memory bandwidth is on those servers. f the models are all roughly the same in terms of how long they take to fit, just run a batch of 18 at a time in 4 parallel chains, as that’ll occupy 72 cores. But filling all the cores may not be optimal if there’s memory contention, so it may work best to run only 12 at a time or some other number less than 18.
If you just have a big cluster that’ll manage this all for you at a higher level, I’d recommend just firing up 50 jobs, one per model, and let the cluster handle the load balancing.