Dear all,
I am experiencing what looks a lot to me like some kind of race condition / interaction between independent cmdstan
model runs (via cmdstanr
) when fitting several models on an HPC cluster. The story is as follows:
- When I run one model at a time, everything works fine (even with several chains in parallel).
- When several models run at the same time (submitted as independent jobs to the HPC cluster), some of the chains in some models crash during sampling (
Chain X finished unexpectedly!
). - However, this only occurs if several models run simultaneously on the same compute node. If the models run on separate nodes, everything works fine.
The riddle to me is how this apparent interaction between models could possibly occur. Some more details:
- If the chains are run sequentially and one chain crashes, then the following chains may still finish successfully. If the chains are run in parallel however, all of them crash at the same time.
- I ensure that the runs have independent output directories, and I request enough resources, in particular lots of temporary storage for the output directory. But anyway, resource requirements are per job, so there should be no resource competition between individual model runs…
- This problem also occurs when running different models with different
cmdstan
executables and different input data simultaneously on the cluster. - When I inspect the output / log of a failed chain, there is no further error message etc. It just stops.
Any ideas on what I could look into still are highly appreciated!
Thank you all in advance
Adrian