Only one chain finished, the others are "frozen"

ditilliof · November 21, 2024, 1:49pm

Hi all,

I have a general question that doesn’t seem directly related to the quality of my Stan code. I’m reaching out to see if others have experienced similar issues and to brainstorm possible causes.

I am fitting an ODE model to biological data using cmdstanr. The simulation involves running four chains with 1,000 iterations each. All four chains start smoothly, but after a certain point, only two chains continue with iterations, and ultimately, only one chain remains active. Once the message “Chain 3 finished in xxxx.x seconds” appears in the terminal (or, equivalently, “Chain n finished in xxxx.x seconds, where n is the number of the only remaining active chain for a specific run) nothing else happens. The remaining chains neither continue running nor show any interruptions. Indeed, this has nothing to do with the common error " Chains finish unexpectedly”, because the chains are (at least according to htop) still running!

Interestingly, yesterday I ran the exact same code multiple times on the same server using the same priors and input data, and the simulation completed successfully for all chains only one of those times. However, when I tried again today, the problem reoccurred.

What I have tried already:

Running the simulation on different servers
Using different seeds (both fixed and randomly generated)
I checked htop and all the 4 parallel processes are still running, even though the common messages "Chain x Iteration: " are not displayed in the simulation terminal
I tried to run the simulation in series, and chain 1 just “froze” at a certain iteration and nothing was displayed anymore.

Has anyone encountered a similar issue or have insights into what might be causing this behavior?

Please, assuming that the code is correctly set up, help me brainstorm and find possible causes and solutions to this problem.

Thank you!

Start of the simulation:

“End” of the simulation:

kaskogsholm · November 21, 2024, 3:44pm

Can you share the code you’re running in more detail?

Have you tested that the ODE integrates successfully throughout the parameter space implied by your priors? When I’ve run into problems like this in the past, this has been the culprit, especially if the ODE is stiff and you’re using a non-stiff solver.

ditilliof · November 22, 2024, 4:21pm

The ODE system is stiff, but I am using a stiff solver (ode_bdf). The system is well defined in the parameter space implied by the priors, and I have already successfuly tried to solve it for different combinations of parameters. If what you suggest was the case, wouldn’t then the model calibration just give warnings? That is at least my experience when choosing troublesome priors.

kaskogsholm · November 22, 2024, 8:58pm

I wasn’t necessarily implying that the system is ill-defined in some region. A system can be well-defined but so numerically intractable that the integration becomes impractically slow. At least in my case, this only occurred for subspaces of the parameter space that were very unlikely, so I was able to improve the situation with more informative priors.

I’m just speculating based on a similar issue I’ve had in the past. You’ll probably get more informed feedback if you provide a minimal example that reproduces this behavior.

ditilliof · November 23, 2024, 8:32am

Okay, I thought you meant ill-definition from how you phrased it. Then you are probably right. I finally solved the “frozen” chains problem by increasing the maximum step size and lowering the tolerance for the ode_bdf solver, thank you for your feedback.

Topic		Replies	Views
Rstan stuck AFTER iterations complete, only when using many observations RStan	9	2016	November 5, 2018
Chains getting stuck/not mixing issues Modeling performance	4	4421	November 23, 2018
Model fitting and sampling issue: Only 1 chain sampling properly Modeling rstan , fitting-issues , performance	14	2504	September 20, 2022
Strange behaviour of the chains when running in parallel General paralellization	2	528	April 26, 2021
Cmdstanr unable to sample - chains are always finishing unexpectedly in Linux Interfaces cmdstanr	0	579	April 25, 2023

Only one chain finished, the others are "frozen"

Related topics