Evaluating parallelization performance

The point that ESS = 1000 is often overkill means that you care more about the time to ESS target than ESS/second after convergence.

ESS/second after convergence (even if that means ESS = 10 or some other stability target) is embarassingly parallelizable, so if we only use that measure, we optimize the wrong thing.

If the geometry is tricky, it’ll just take a lot more iterations to get to an ESS target—it shouldn’t change that target unless you distrust the ESS estimator.

1 Like

Exactly. With the maturity of the sampler and increasing MPI/GPU functionality we’re entering the point where the practical limitations are most often not ESS/second but rather warmup/adaptation and the structure of the target distribution (parameterization, priors) itself.

Wrt (1), from a user standpoint do we have facilities so that we can restart a set of chains (started with 4 now you run 8 chains in parallel or something like that?) Not sure if that’s legal

For (2) is there useful information chains could pass to one another? If not then agree we can keep focusing on gradient evaluation.

Thinking about this though

Instead of thinking about corralling everything back to the typical set, is there useful information we can draw from the bad chain(s) while the others are doing well?

Sort of. In most of the interfaces the user can grab the inverse metric and step size and final points and restart chains in the same state. What cannot currently be done is restarting using the same PRNG state. Technically this is fine.

Yes. Under some criteria the ideal inverse metric equals the covariance of the target distribution, which is estimated during warmup. Chains could pool these individual estimates at certain intervals, producing more precise estimates earlier. A pooled estimator might also be able to pull in a stray chain that gets stuck with a poor estimate early on due to getting stuck in a bad corner of the typical set (which results in a wicked low step size, and hence poor exploration, and even worse covariance estimates; it’s a vicious cycle). How relevant this is, however, depends on how often this poor adaptation can happen when the chains should be well-behaved; see below.

The thing you have to recognize is that if the Markov chains are exploring well then they should all look equivalent. If one chain looks weird it doesn’t mean that that one chain is weird and the others are fine-- it means that the others have yet to encounter the weirdness. In other words inconsistencies between the chains indicates that the Markov chains aren’t going to explore as much as they should. Incidentally, this is the logic behind Rhat.

This is complicated by the fact that early on the chains will be noisy and there might be some fluctuations that look like a difference but aren’t (and might maybe cause different adaptations between the chains). It’s further complicated by the fact that if you run long enough then all of the chains will find the pathology and start to look the same again, even though the sampling is not to be trusted (i.e. chains looking the same is a necessary but not sufficient condition). Hence there’s some sweet spot where you run long enough for one or more, but not all, of the chains can find a pathology and register a difference. This is why Rhat isn’t that sensitive to problems – the chains have to be run just long enough to see the problems but not so long that the chains look the same again.

Anyways, no matter what we do we want to ensure that we don’t compromise our diagnostic capabilities.

1 Like