Race conditions between independent cmdstan model runs

Dear all,

I am experiencing what looks a lot to me like some kind of race condition / interaction between independent cmdstan model runs (via cmdstanr) when fitting several models on an HPC cluster. The story is as follows:

  • When I run one model at a time, everything works fine (even with several chains in parallel).
  • When several models run at the same time (submitted as independent jobs to the HPC cluster), some of the chains in some models crash during sampling (Chain X finished unexpectedly!).
  • However, this only occurs if several models run simultaneously on the same compute node. If the models run on separate nodes, everything works fine.

The riddle to me is how this apparent interaction between models could possibly occur. Some more details:

  • If the chains are run sequentially and one chain crashes, then the following chains may still finish successfully. If the chains are run in parallel however, all of them crash at the same time.
  • I ensure that the runs have independent output directories, and I request enough resources, in particular lots of temporary storage for the output directory. But anyway, resource requirements are per job, so there should be no resource competition between individual model runs…
  • This problem also occurs when running different models with different cmdstan executables and different input data simultaneously on the cluster.
  • When I inspect the output / log of a failed chain, there is no further error message etc. It just stops.

Any ideas on what I could look into still are highly appreciated!

Thank you all in advance

1 Like

Short update from my side: I found a workaround to this problem, which hints at a potential bug in cmdstan that is rather deeply buried. I containerized all the jobs using singularity on our HPC. However, this didn’t solve the problem at first, because singularity has some default bind paths: Only after I explicitly excluded /proc from the bind paths, there were no more interactions between independent jobs. Now everything is working perfectly fine and there are no crashing chains.

I find it notable that separating /proc between the jobs was necessary to solve the issue - maybe someone has an idea what could be going on behind the scenes? Given that the chains always crashed exactly when the first model on the compute node finished sampling, could there be some issue with a misdirected pointer (pointing to the wrong instance of stan) or so?

Additional info: On the HPC, jobs are managed using SLURM.

perhaps this is an R problem - you’re running too many R jobs at once?

@mitzimorris I might have a misconception here, but I so far thought that because cmdstanr just calls cmdstan, crashes during sampling should not be related to R…?

crashes during sampling might be related to the R library used to dispatch the processes that run CmdStan. @rok_cesnovar ?