Profiling HMC

Dashadower · December 22, 2022, 7:29am

I’m curious as to what portion of sampling constitutes the majority of the computation time. From my understanding of autodiff and the internals of HMC, I’ve reasoned about the following:

Log density gradient calculation - Doesn’t seem costly as time complexity scales linearly with expression tree height and backward autodiff doesn’t get affected by number of input variables.
Integrating the Hamiltonian equations - Leapfrog integration should mean linear time complexity with respect to # of timesteps
Metropolis Hastings correction - I think this would be the culprit since rejection means we redo all the steps again until acceptance. For high-curvature posteriors that would mean on average more rejections until acceptance, multiplicating the cost of the entire process.

Are my insights correct? More specifically, does slow sampling occur because posteriors with complex geometry generally cause the leapfrog result to diverge more from the true trajectory, leading to lower acceptance rates, which on average requires multiple proposals?

hhau · December 22, 2022, 8:18am

Simplifying greatly and ignoring all model-specific challenges, more complex posteriors require smaller HMC step sizes to achieve the target acceptance rate. Smaller step sizes mean more leapfrog steps (L) per-iteration and the computational cost per-iteration is something like \mathcal{O}(2^{L - 1}). Most of the computational cost thus occurs within your second point: “Integrating the Hamiltonian equations” – whilst the computational cost is linear w.r.t the number of leapfrog steps, the number of leapfrog steps is a very nonlinear function of “posterior complexity/curvature”.

Though, this being statistics, there are many exceptions to this intuition making it hard to make any kind of general statement. For “simple” models with huge quantities of data (large N), the posterior is usually not very geometrically interesting/curved/difficult. Nonetheless, HMC and other MCMC methods are expensive in this setting as the computational cost of computing both the log-posterior and the gradient thereof are often at least \mathcal{O}(N).

betanalpha · January 17, 2023, 5:15pm

You’re correct that just using the final state can cause inefficiencies when the numerical trajectory happens to end with larger error, but Stan’s Hamiltonian Monte Carlo sampler does not consider just the final state in a numerical trajectory. Rather it considers all of the intermediate states which makes is much more robust to the integration time.

In general the computation cost will be determined by the size of the model and how efficiently it is implemented, which determines the cost of the gradient evaluations, and the geometry of the posterior distribution, which determines how many states, and hence how many gradient evaluations, are needed in each trajectory.

If the numerical trajectories are less than 10 states then there’s very little optimization possible in the geometry and the almost all optimization opportunities will be in the implementation of the model (more efficient linear algebra and autodiff, parallelization, etc). If the trajectories are much longer then changing the model itself will likely be more useful if possible (more informative priors, adding auxiliary data, reparameterizations). That said speeding up the model evaluations will always speed up the fit not matter the geometry.

Topic		Replies	Views
HMC: Does precision of derivatives affect validity of results? Algorithms stan-math	4	939	April 26, 2022
What is the relationship between transition and iteration General	3	1780	April 12, 2021
Numerical errors...do we have some intuition on its severity? Developers	7	980	March 15, 2017
Tried to Speed Up Stan (HMC) by Rewriting Dynamics, failed Algorithms	2	675	June 17, 2020
Why is HMC faster than ADVI? General advi	4	504	January 8, 2023

Profiling HMC

Related topics