Race conditions between independent cmdstan model runs

alison · March 31, 2023, 9:19am

Dear all,

I am experiencing what looks a lot to me like some kind of race condition / interaction between independent cmdstan model runs (via cmdstanr) when fitting several models on an HPC cluster. The story is as follows:

When I run one model at a time, everything works fine (even with several chains in parallel).
When several models run at the same time (submitted as independent jobs to the HPC cluster), some of the chains in some models crash during sampling (Chain X finished unexpectedly!).
However, this only occurs if several models run simultaneously on the same compute node. If the models run on separate nodes, everything works fine.

The riddle to me is how this apparent interaction between models could possibly occur. Some more details:

If the chains are run sequentially and one chain crashes, then the following chains may still finish successfully. If the chains are run in parallel however, all of them crash at the same time.
I ensure that the runs have independent output directories, and I request enough resources, in particular lots of temporary storage for the output directory. But anyway, resource requirements are per job, so there should be no resource competition between individual model runs…
This problem also occurs when running different models with different cmdstan executables and different input data simultaneously on the cluster.
When I inspect the output / log of a failed chain, there is no further error message etc. It just stops.

Any ideas on what I could look into still are highly appreciated!

Thank you all in advance
Adrian

alison · June 29, 2023, 3:04pm

Short update from my side: I found a workaround to this problem, which hints at a potential bug in cmdstan that is rather deeply buried. I containerized all the jobs using singularity on our HPC. However, this didn’t solve the problem at first, because singularity has some default bind paths: Only after I explicitly excluded /proc from the bind paths, there were no more interactions between independent jobs. Now everything is working perfectly fine and there are no crashing chains.

I find it notable that separating /proc between the jobs was necessary to solve the issue - maybe someone has an idea what could be going on behind the scenes? Given that the chains always crashed exactly when the first model on the compute node finished sampling, could there be some issue with a misdirected pointer (pointing to the wrong instance of stan) or so?

Additional info: On the HPC, jobs are managed using SLURM.

mitzimorris · June 29, 2023, 5:12pm

perhaps this is an R problem - you’re running too many R jobs at once?

alison · June 30, 2023, 8:57am

@mitzimorris I might have a misconception here, but I so far thought that because cmdstanr just calls cmdstan, crashes during sampling should not be related to R…?

mitzimorris · July 3, 2023, 1:59pm

crashes during sampling might be related to the R library used to dispatch the processes that run CmdStan. @rok_cesnovar ?

Topic		Replies	Views
CmdstanR models fail on HPC cluster when running concurrently on the same node Interfaces	1	391	July 19, 2023
Running chains on multiple cores Developers	2	886	January 30, 2023
Only one chain finished, the others are "frozen" General rstan , fitting-issues , performance , cmdstanr	4	105	November 23, 2024
Weird inconsistent behavior between OSX and linux cluster on same Stan model Modeling	2	421	April 15, 2021
Running cmdstanr in parallel on computing cluster General	6	968	December 9, 2022

Race conditions between independent cmdstan model runs

Related topics