Last chain hangs up during fitting

ptheguy · July 28, 2019, 9:35pm

Hi,

I’m using 4 chains in sampling() to infer the internal parameters of a model. The first three chains finish in about the same time, while the last chain takes about 10 times longer. The final fit result has a very large R_hat.
If I use 5 chains, then the first 4 finish in about the same time, while the 5th one takes 10 times longer. Again, very large R_hat.
In general, if I use n chains, n-1 chains finish in about the same time, while the last takes much longer.

One way to fix this is to generate a new set of data–this is because there is some randomness in the data so new sets could result in cleaner fits. But, my question is not regarding how I can fix this. I would like to better understand the underlying processes of Stan that leads to such a behavior. I’m basically wondering what things could be causing all chains but last one to properly finish in time.

nhuurre · July 29, 2019, 10:36am

How do you know which chain takes the longest time? Stan usually runs the chains in parallel so the last chain to finish is tautologically the one that takes the longest time. If you set seed=12345 each run should give the exact same results and chain_id=[1,2,3] instead of chains=3 allows you to select which chains to run. After each run Stan does some postprocessing that may take a while. You can disable that with check_hmc_diagnostics=False.

Large r-hat suggests your model is multimodal or pathological in some other way. The one bad chain may get stuck somewhere in parameter space where the model fits very poorly and the posterior has complicated geometry.

ptheguy · July 29, 2019, 1:51pm

Thank you for clarifying that.

How do you know which chain takes the longest time?

Well, the output shows you the progress of each chain (i.e. Iteration n/N), so in case of chains=4, the first three finish within a few seconds, while the last one takes upwards of 1000 seconds. Also, I am running my 4 chains on 4 CPUs, so I believe each chain is getting the equal chance of processing at the same speed.

nhuurre · July 29, 2019, 3:57pm

So by last chain you just mean the chain that takes longest to finish. Each chain has random initialization so it’s a bit weird that exactly one chain gets stuck but I assume you didn’t run 40 chains just to see how many of them get stuck. If the chance of a chain getting stuck is low it would sometimes seem that everything is fine and other times you’d get one misbehaving chain. That’s all I can say without more information about your model.

stijn · July 30, 2019, 4:19am

Just as another guess, it might be that the fast chains are getting stuck (at the same parameter values) and that the slow chain is actually sampling. It has happened to me before with a model that was barely identified. Some chains get stuck at some values and “sample” very fast. True sampling (i.e. exploration of the full posterior) however takes a long time because of a small stepsize. Have you plotted the different chains?

ptheguy · July 30, 2019, 3:57pm

It has happened to me before with a model that was barely identified.

I think this is what’s happening to me as well. \alpha is the parameter I’m trying to infer and with alpha ~ normal(0,10) prior, the inference is poor, but with alpha ~ normal(0,10) and real<lower=3,upper=7> alpha, the inference is spot on. I think it’s just that the fast processing chains are getting stuck at a favorable, yet wrong, alpha value in parameter space.

Topic		Replies	Views
Rstan stuck AFTER iterations complete, only when using many observations RStan	9	1973	November 5, 2018
Only one chain finished, the others are "frozen" General rstan , fitting-issues , performance , cmdstanr	4	107	November 23, 2024
Model taking too long Algorithms rstan , fitting-issues	1	426	November 4, 2022
Sampler getting stuck? Modeling	18	3679	May 26, 2019
Sampling function taking too long to run at Chain 4 Modeling	2	687	April 12, 2022

Last chain hangs up during fitting

Related topics