cmdstanR and rstan random run failures on computing cluster

I’m working on a project where I need to repeat the same stan run 1000 times on different simulated dataset. I’m using a computing cluster. Somehow I noticed that slurm array jobs or parallelization using R dopar don’t work. But if I do everything in one R session as an interactive job without using parallelization, the runs can finish. This seems to be related to the following post: Chains finish unexpectedly in new install of CmdStanR. I wonder whether there’s a solution yet?

More details below. I believe the computer cluster I’m using is under a linux environment.

I started by doing slurm array jobs, where each job is assigned 2 cores (2GB memory each), and I’d ask it to do a single-chain stan run. I tried both cmdstanr and rstan. What I noticed is that, every time I assign a slurm array of 1000 jobs, a large number of these jobs would fail with the following error:

299 Chain 1 Iteration: 17100 / 21000 [ 81%] (Sampling)
300 Warning: Chain 1 finished unexpectedly!
301
302 Error: No chains finished successfully. Unable to retrieve the draws.
303 In addition: Warning message:
304 No chains finished successfully. Unable to retrieve the fit.
305 Execution halted

Then I’ll need to keep rerunning those runs that have failed multiple times until all of them are done. This has been rather frustrating.

I then tried to do everything in one single R session, using foreach and dopar to parallelize the runs. But I got the same behavior where some of these runs would just fail, and I’m not able to collect the results unless I explicitly put in error handling procedures within the foreach loop.

Last I tried to do everything in one single R session, but this time not using any parallelization. Instead I used 10-chains and reduce the number of iterations for each chain. With this setup, I’m able to finish all the stan runs (using cmdstanr). But it definitely takes much longer time compared to if I were able to use slurm array jobs.

I wonder whether anyone has experienced a similar issue, or anyone knows what’s the reason behind this, and how I should change my setup to make things work.

Many thanks!