What's the intuition behind HMC iteration wall time and how it explores distributions?


#1

I’ve noticed that when running multiple chains, there will often be a few that are done quite fast while there’s one that takes much longer to run compared to the rest. In addition, the time interval each iteration requires seems to be approximately constant throughout the process.

I recall seeing a talk by Michael Betancourt where he explained that HMC finds the typical set of the distribution and then we get our samples as the algorithm traverses it. It’s also my understanding that the algorithm is able to adjust certain parameters on its own in order to be as efficient as possible.

Intuitively, I would assume that the difference in chain times could be attributed to different starting points, but I’m not sure why once a typical set is found it would still be showing such heterogeneity between chains and why iteration time wouldn’t be reduced as the algorithm “maps out” the space.


#2
  1. If you’re running multiple chains in parallel then one or more chains can be slowed down if you’re operating system is also running other processes. This is not uncommon if you run four chains and have only four cores on your machine.

  2. If there are neighborhoods of the typical set that are difficult to explore then many of the chains might miss those neighborhoods, exploring the rest of the typical set quickly. Those chains that might luck into the pathological neighborhoods would, however, spend much more time trying to escape them. This indicates that none of the chains are fully exploring.

  3. Some of the chains might have initialized in particularly extreme parts of the parameter space, especially if they are initialized very far away from the typical set, where the adaptation routine might be stressed to the point where it breaks, causing the sampling performance to suffer.

(2) and (3) are indicated by differences in adaption (step size and inverse metric/mass matrix elements), sampler behavior (number of leapfrog steps, divergences), or exploration (trace plots). If all of those are relatively uniform across the chains then it might be (1).