Profiling HMC

I’m curious as to what portion of sampling constitutes the majority of the computation time. From my understanding of autodiff and the internals of HMC, I’ve reasoned about the following:

  • Log density gradient calculation - Doesn’t seem costly as time complexity scales linearly with expression tree height and backward autodiff doesn’t get affected by number of input variables.
  • Integrating the Hamiltonian equations - Leapfrog integration should mean linear time complexity with respect to # of timesteps
  • Metropolis Hastings correction - I think this would be the culprit since rejection means we redo all the steps again until acceptance. For high-curvature posteriors that would mean on average more rejections until acceptance, multiplicating the cost of the entire process.

Are my insights correct? More specifically, does slow sampling occur because posteriors with complex geometry generally cause the leapfrog result to diverge more from the true trajectory, leading to lower acceptance rates, which on average requires multiple proposals?

Simplifying greatly and ignoring all model-specific challenges, more complex posteriors require smaller HMC step sizes to achieve the target acceptance rate. Smaller step sizes mean more leapfrog steps (L) per-iteration and the computational cost per-iteration is something like \mathcal{O}(2^{L - 1}). Most of the computational cost thus occurs within your second point: “Integrating the Hamiltonian equations” – whilst the computational cost is linear w.r.t the number of leapfrog steps, the number of leapfrog steps is a very nonlinear function of “posterior complexity/curvature”.

Though, this being statistics, there are many exceptions to this intuition making it hard to make any kind of general statement. For “simple” models with huge quantities of data (large N), the posterior is usually not very geometrically interesting/curved/difficult. Nonetheless, HMC and other MCMC methods are expensive in this setting as the computational cost of computing both the log-posterior and the gradient thereof are often at least \mathcal{O}(N).

4 Likes

You’re correct that just using the final state can cause inefficiencies when the numerical trajectory happens to end with larger error, but Stan’s Hamiltonian Monte Carlo sampler does not consider just the final state in a numerical trajectory. Rather it considers all of the intermediate states which makes is much more robust to the integration time.

In general the computation cost will be determined by the size of the model and how efficiently it is implemented, which determines the cost of the gradient evaluations, and the geometry of the posterior distribution, which determines how many states, and hence how many gradient evaluations, are needed in each trajectory.

If the numerical trajectories are less than 10 states then there’s very little optimization possible in the geometry and the almost all optimization opportunities will be in the implementation of the model (more efficient linear algebra and autodiff, parallelization, etc). If the trajectories are much longer then changing the model itself will likely be more useful if possible (more informative priors, adding auxiliary data, reparameterizations). That said speeding up the model evaluations will always speed up the fit not matter the geometry.

1 Like