CmdstanR models fail on HPC cluster when running concurrently on the same node

This is more a discussion topic than an error I’m looking for help with. I’ve been fitting model with both cmdstanr and rstan on an HPC cluster. I often submit 5 jobs to the cluster at the same time for 5-fold cross validation, and usually multiple models too. If I use SLURM defaults when submitting jobs to the cluster, sometimes these models end up running on the same node.

There is no problem when I do this with rstan, but I’ve noticed that with cmdstanr, models on the same node fail when it gets to the end of the job. I think it is happening when it gets to the write step. One of the models running on the node will complete successfully, presumably the one that gets to the write step first.

I can get around this quite easily by adding #SBATCH --exclusive to my SLURM script, which only allows one job per node. But this doesn’t scale well for me on my organisation’s small on-prem cluster, when I need to fit (and cross-validate) a few dozen different models concurrently as part of an ensemble.

Does anyone know if this is expected behaviour given what they understand about how rstan and cmdstan write output? I thought it interesting that it was only happening with cmdstanr and not rstan.

The output I get from the failed models are

Compiling Stan program...
Start sampling
Running MCMC with 5 parallel chains...

Warning: Chain 1 finished unexpectedly!

Warning: Chain 2 finished unexpectedly!

Warning: Chain 3 finished unexpectedly!

Warning: Chain 4 finished unexpectedly!

Warning: Chain 5 finished unexpectedly!

Warning: Use read_cmdstan_csv() to read the results of the failed chains.
Error in rstan::read_stan_csv(out$output_files()) :
  csvfiles does not contain any CSV file name
In addition: Warning messages:
1: All chains finished unexpectedly!

2: No chains finished successfully. Unable to retrieve the fit.```

If possible, add also code to simulate data or attach a (subset of) the dataset you work with.

Operating System:  Linux (on Apptainer)
Interface Version:  CmdStan 2.32.2 (and rstan 2.21.8)
1 Like

We have the same problem (and I have seen multiple threads with similar problems) but no solution. There is a guess, that it might be the intel compiler (our cluster only offers R paired with the intel compiler) but my ticket is now open for half a year or so and so far they apparently haven’t found the problem.
https://twitter.com/dan_p_simpson/status/1571705560634105857 might point to a potential solution.