Multiple chains and posterior exploration

Hi!

I read in another thread that one could run more chains for a lower number of post-warmup iterations to accelerate posterior sampling.

But pushed toward the absurd, I have a strange feeling : all things being egual, would a joint posterior distribution explored as well with 200 chains computing 10 post-warmup iterations than 2 chains computing 1000 post-warmup iterations?

My intuition says no, for we need multiple chains to ensure that one is not stationary because stucked in a particular region of the posterior space.

I guess that there should be mathematical demonstrations of that kind of properties. I hope not to ask for a trivial answer, but I do not have enough background to find the information by myself :)

Lucas

You wouldn’t want 2 post-warmup iterations for anything like Rhat that calculates within-chain variance. In general, for a fixed number of total draws, having more chains and fewer draws per chain gives you a better chance of discovering that some of the chains are problematic due to difficulties with the posterior geometry. I don’t think the overall result is going to have better mixing until we pool adaptation information across chains.

1 Like

Depending on the computing setup, we should not necessarily think of total #iterations or even total #leapfrog steps as a constant. If you have access to parallel computing, the alternative to 2 chains with 1000 iterations each, could be 100 chains with 1000 iterations each.

@ldeschamps was suggesting 10, which is the number @betanalpha cited previously, but even that may be problematic for reasons @betanalpha mentions—not being able to diagnose stuck chains effectively, which will result in a biased posterior.

Thank all for the answer :)

My question was kind of conceptual, if one want to reduce wall time by reducing iterations and multiplying chains, what would be the trade-offs to consider. Enough iterations to produce viable diagnostics and to be able to spot stuck chains are two great suggestions!

I guess one suggestion would be to use map_rect instead, but multiplying chains might be more interesting if one have a complex model to fit with relatively low number of observations, or if information transfer among cores is too expansive (how could it?).

1 Like

Excellent way of putting it!

The important question is then “how long does it take to produce viable diagnostics”? Diagnostics like Rhat are iteration hungry – it takes a good number of effective samples to be able to resolve differences in the chains, which is why Rhat often misses pathologies. Diagnostics like divergences are more sensitive, but you still need reasonably long chains to get enough divergences to identify where in the model the pathology is manifesting.

Ultimately you need to run each chain long enough to get reasonable expectation estimates, so by the time the diagnostics are really robust the chain will almost be long enough for your inferential goal! There is some wiggle room, however, and there is potentially opportunity for moderate speed ups with 2-10 chains. Any more than that should be considered only for improving diagnostics (more opportunities to randomly fall onto a pathology early), not speeding up inferences.

1 Like