Only one chain finished, the others are "frozen"

Hi all,

I have a general question that doesn’t seem directly related to the quality of my Stan code. I’m reaching out to see if others have experienced similar issues and to brainstorm possible causes.

I am fitting an ODE model to biological data using cmdstanr. The simulation involves running four chains with 1,000 iterations each. All four chains start smoothly, but after a certain point, only two chains continue with iterations, and ultimately, only one chain remains active. Once the message “Chain 3 finished in xxxx.x seconds” appears in the terminal (or, equivalently, “Chain n finished in xxxx.x seconds, where n is the number of the only remaining active chain for a specific run) nothing else happens. The remaining chains neither continue running nor show any interruptions. Indeed, this has nothing to do with the common error " Chains finish unexpectedly”, because the chains are (at least according to htop) still running!

Interestingly, yesterday I ran the exact same code multiple times on the same server using the same priors and input data, and the simulation completed successfully for all chains only one of those times. However, when I tried again today, the problem reoccurred.

What I have tried already:

  • Running the simulation on different servers
  • Using different seeds (both fixed and randomly generated)
  • I checked htop and all the 4 parallel processes are still running, even though the common messages "Chain x Iteration: " are not displayed in the simulation terminal
  • I tried to run the simulation in series, and chain 1 just “froze” at a certain iteration and nothing was displayed anymore.

Has anyone encountered a similar issue or have insights into what might be causing this behavior?

Please, assuming that the code is correctly set up, help me brainstorm and find possible causes and solutions to this problem.

Thank you!

Start of the simulation:

“End” of the simulation:

Can you share the code you’re running in more detail?

Have you tested that the ODE integrates successfully throughout the parameter space implied by your priors? When I’ve run into problems like this in the past, this has been the culprit, especially if the ODE is stiff and you’re using a non-stiff solver.

The ODE system is stiff, but I am using a stiff solver (ode_bdf). The system is well defined in the parameter space implied by the priors, and I have already successfuly tried to solve it for different combinations of parameters. If what you suggest was the case, wouldn’t then the model calibration just give warnings? That is at least my experience when choosing troublesome priors.

I wasn’t necessarily implying that the system is ill-defined in some region. A system can be well-defined but so numerically intractable that the integration becomes impractically slow. At least in my case, this only occurred for subspaces of the parameter space that were very unlikely, so I was able to improve the situation with more informative priors.

I’m just speculating based on a similar issue I’ve had in the past. You’ll probably get more informed feedback if you provide a minimal example that reproduces this behavior.

Okay, I thought you meant ill-definition from how you phrased it. Then you are probably right. I finally solved the “frozen” chains problem by increasing the maximum step size and lowering the tolerance for the ode_bdf solver, thank you for your feedback.