Slurmstepd: error: exceeded memory limit


#1

Dear Stan Group,

I have written a clustering algorithm, which takes a dataset as input and run it for a range of cluster sizes and pick the best fitting cluster size. For parameter estimation, this algorithm uses rstan package on Rstudio (v. 1.0.153 and R v. 3.4.1) to run a Monte Carlo Expectation Maximization algorithm via MCMC.

I’m using sampling() function in rstan to run 3 chains in parallel with iterations set to 1000 or more. This algorithm was tested on a Mac Pro computer and gives expected results. The run time per dataset is about 20 hours.

I need to do many simulation studies with this method. So I have reached out to Canada’s high performance computer cluster for students, called Sharcnet. Some info on this cluster: 33448 CPU cores, 1043 nodes, 149 TB RAM. I tried running my code on sharcnet (using --nodes=1, --mem 100000 GB, --cpus-per-task=32). Unfortunately, since March 2017 I got the following error.

slurmstepd: error: Job 445463 exceeded memory limit (104499848 > 102400000), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 445463 ON gra600 CANCELLED AT 2017-08-24T18:21:58 ***

I have been trying to solve this issue with Sharcnet and associated scientists. At first we thought it was my own code that was giving the issue. But two days ago, I got this reply from one of the scientists:

“I did a couple of tests jobs in different nodes in our cluster and unfortunately I ran into the same issues with the memory. I know this is not encouraging, but I was also able to discover a few more things… as a matter of fact, I have a theory of what I think might be going on… rstan or the mcmc package you are using is running its own parallelism by launching mclusters, if you monitor the evolution of the job in real-time, you will notice that at some points there hundred (I counted up to 322) processes running each of them using between 0.2 and 1% of the total memory, so you can see how this will add up and basically dry all the memory.
This behavior of R, is apparently known in the community and it appears to be related to the OS, in this cases either graham or SciNet clusters use CentOS Linux OS. It may be the case that your laptop (using macOS) and my workstation which uses an Ubuntu Linux distribution, handle this in a better way or they will just allow swapping (a technique that allows the computer to use disk as if it were memory) which is not available in cluster systems type.
As for possible solutions, one thing is to start looking for alternative ways of performing the mclust parallelism or find a way to take control of all these “left over” processes that R leaves…”

Do you have any suggestions for this issue with rstan?

Thank you very much!


#2

What happens if you call Stan with cores = 1?


#3

Hello @bgoodri

Thank you for the prompt reply. I have tried commenting out the following options when running code in the past.

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

According to R documentation, I believe this means cores option in sampling() function is set to 1 by default. This still produced the same error I wrote earlier.

Do you have any other suggestions?

Thank you.


#4

Does example(mclapply, package = "parallel") work on the Canada cluster?


#5

For a cluster, I’d recommend running CmdStan—it only needs the dynamic memory overhead to compute the log density and gradients—all draws are streamed out. RStan holds onto the draws in memory and requires additional overhead for some copying.

You can use RStan to write the data out (rstan_dump or dump_rstan or something like that) in the form needed for CmdStan. Then you can read the chain output back into R for analysis.

If you run multiple chains in R, it will fire off parallel jobs if you’ve set it that way (the magic incantation is dumped out when it’s loaded into R).