Last chain hangs up during fitting

Hi,

I’m using 4 chains in sampling() to infer the internal parameters of a model. The first three chains finish in about the same time, while the last chain takes about 10 times longer. The final fit result has a very large R_hat.
If I use 5 chains, then the first 4 finish in about the same time, while the 5th one takes 10 times longer. Again, very large R_hat.
In general, if I use n chains, n-1 chains finish in about the same time, while the last takes much longer.

One way to fix this is to generate a new set of data–this is because there is some randomness in the data so new sets could result in cleaner fits. But, my question is not regarding how I can fix this. I would like to better understand the underlying processes of Stan that leads to such a behavior. I’m basically wondering what things could be causing all chains but last one to properly finish in time.

3 Likes

How do you know which chain takes the longest time? Stan usually runs the chains in parallel so the last chain to finish is tautologically the one that takes the longest time. If you set seed=12345 each run should give the exact same results and chain_id=[1,2,3] instead of chains=3 allows you to select which chains to run. After each run Stan does some postprocessing that may take a while. You can disable that with check_hmc_diagnostics=False.

Large r-hat suggests your model is multimodal or pathological in some other way. The one bad chain may get stuck somewhere in parameter space where the model fits very poorly and the posterior has complicated geometry.

Thank you for clarifying that.

How do you know which chain takes the longest time?

Well, the output shows you the progress of each chain (i.e. Iteration n/N), so in case of chains=4, the first three finish within a few seconds, while the last one takes upwards of 1000 seconds. Also, I am running my 4 chains on 4 CPUs, so I believe each chain is getting the equal chance of processing at the same speed.

So by last chain you just mean the chain that takes longest to finish. Each chain has random initialization so it’s a bit weird that exactly one chain gets stuck but I assume you didn’t run 40 chains just to see how many of them get stuck. If the chance of a chain getting stuck is low it would sometimes seem that everything is fine and other times you’d get one misbehaving chain. That’s all I can say without more information about your model.

Just as another guess, it might be that the fast chains are getting stuck (at the same parameter values) and that the slow chain is actually sampling. It has happened to me before with a model that was barely identified. Some chains get stuck at some values and “sample” very fast. True sampling (i.e. exploration of the full posterior) however takes a long time because of a small stepsize. Have you plotted the different chains?

1 Like

It has happened to me before with a model that was barely identified.

I think this is what’s happening to me as well. \alpha is the parameter I’m trying to infer and with alpha ~ normal(0,10) prior, the inference is poor, but with alpha ~ normal(0,10) and real<lower=3,upper=7> alpha, the inference is spot on. I think it’s just that the fast processing chains are getting stuck at a favorable, yet wrong, alpha value in parameter space.