Hi all,
I have a general question that doesn’t seem directly related to the quality of my Stan code. I’m reaching out to see if others have experienced similar issues and to brainstorm possible causes.
I am fitting an ODE model to biological data using cmdstanr
. The simulation involves running four chains with 1,000 iterations each. All four chains start smoothly, but after a certain point, only two chains continue with iterations, and ultimately, only one chain remains active. Once the message “Chain 3 finished in xxxx.x seconds” appears in the terminal (or, equivalently, “Chain n finished in xxxx.x seconds, where n is the number of the only remaining active chain for a specific run) nothing else happens. The remaining chains neither continue running nor show any interruptions. Indeed, this has nothing to do with the common error " Chains finish unexpectedly”, because the chains are (at least according to htop) still running!
Interestingly, yesterday I ran the exact same code multiple times on the same server using the same priors and input data, and the simulation completed successfully for all chains only one of those times. However, when I tried again today, the problem reoccurred.
What I have tried already:
- Running the simulation on different servers
- Using different seeds (both fixed and randomly generated)
- I checked htop and all the 4 parallel processes are still running, even though the common messages "Chain x Iteration: " are not displayed in the simulation terminal
- I tried to run the simulation in series, and chain 1 just “froze” at a certain iteration and nothing was displayed anymore.
Has anyone encountered a similar issue or have insights into what might be causing this behavior?
Please, assuming that the code is correctly set up, help me brainstorm and find possible causes and solutions to this problem.
Thank you!
Start of the simulation:
“End” of the simulation: