Problem running brms on cluster

I need to fit a model using brms on a Linux cluster (Ubuntu 18.04.5 LTS managed via slurm). I wrote a docker container that installs brms (Docker Hub - by the way, I’m very new to this so I would also really appreciate any feedback on the Dockerfile). I then use an sbatch file, which calls Rscript to run the R code I need. The problem is, soon after sampling starts, I get this error:

Compiling Stan program…
Start sampling

SAMPLING FOR MODEL ‘9a6b394498a04dca808116db70d50036’ NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.128275 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1282.75 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 4000 [ 0%] (Warmup)
srun: error: compute-3: task 0: Segmentation fault (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=1867745.0

It’s not a matter of there not being enough RAM - I’ve run this model successfully on machines that had far less RAM (the reason I had to switch to the cluster was because on machines with less RAM, I used to get a crash at the later stage of making predictions or calculating WAIC/LOO, ostensibly because of a lack of RAM).

This is the call to brms:

rep1_gam_no_sr = brm(peak_dens ~ s(distance) + (1|genes), family = “bernoulli”, data = gam_data, cores = 1, set_prior(“normal(0, 100)”), control = list(adapt_delta = 0.96), iter = 4000)

Note that I am running it with cores = 1 because when I run it with cores = 4, I get this:

Chain 2: Iteration: 1 / 4000 [ 0%] (Warmup)
Chain 1: Iteration: 1 / 4000 [ 0%] (Warmup)
Chain 3: Iteration: 1 / 4000 [ 0%] (Warmup)
Chain 4: Iteration: 1 / 4000 [ 0%] (Warmup)
Error in FUN(X[[i]], …) :
trying to get slot “mode” from an object of a basic class (“NULL”) with no slots
Calls: brm … eval → .fun → .fun → .local → sapply → lapply → FUN
In addition: Warning message:
In mccollect(jobs) : 4 parallel jobs did not deliver results
Execution halted

I would be very grateful for any ideas!

Hi and welcome,

Sorry about the delay in getting to your question. Does your model run with a subset of the data just on a desktop or laptop?

I don’t know much about slurm but I remember you can get those seg faults if the programs that manages memory sharing are not setup correctly. The slurm community they might be better able to help with the setup.

ara

Dear Ara,

It is now my turn to apologize for not replying sooner to your reply, it’s been a crazy week.

Yes, the model runs fine on computers not managed by slurm (the reason I need to take it to the big HPC is because on those other machines, I run out of memory at later steps like making predictions or getting WAIC scores).

I will take your advice of checking with the slurm community.

Thank you fro your help!

1 Like

No worries. Were you able to get this sorted out? I went back through and re-read your question.

Does the model run with cores = 4 on machines without slurm? How many data points and what is the structure of the data?

I know there are some tricks to getting better performance in Stan/brms like reframing the data to remove non-informative cells, narrowing down the priors, and some other tips.

Hi Ara,

Thanks for going back to this problem. The thing is, I have a lot of deadlines coming up on other projects right now so I had to put the modelling, which is more of a passion project, on the burner for a bit. Yes, on other (non-slurm) machines it runs just fine with cores = 4. I haven’t managed to solve this issue but I am actually learning Stan at the moment, so I am hoping that once I know how to build models this complicated in Stan, I will be able to bypass brms altogether. If the problem is coming from brms and not Stan itself, then this may solve the problem…

1 Like

Have you been able to run the script inside the docker container on your own machine? It’s possible that the image isn’t configured correctly. I went through all sorts of weird, sporadic difficulties using brms/stan in a Docker image on an slurm HPC before I had everything figured out. A lot of it ended up being caused by host system settings accidentally bleeding through to the container, which led to some difficulties with the compiler. If you’re okay with using a slightly old Stan version (rstan v. 2.21.2), you’re welcome to try my Docker image.

If you can get the docker file to run the model locally (or a smaller subset), then I’d next recommend trying to run in on the HPC in interactive mode; I’ve found it’s much easier when trying to figure out why things aren’t working.

Hi Christopher,

I would love to take a look at your Docker image to compare it to what I put together but for some reason, the link doesn’t seem to be working. It’s highlighted but I can’t click on it. Could you perhaps paste the link as raw text?

Thank you for your trouble,
Rosina.

Link fixed (you can also pull it as crpeters/docker-stan:21.1.2-mkl.

Christopher, it all works like a charm with your Docker file so this is indeed where the problem must have been! Thank you so much for your help!