After reading some fun papers, I attempted to rewrite the Hamiltonian dynamics behind Stan’s version of HMC. The thought was that this rewriting requires less computational effort. The hope was that the reduced computational effort would decrease the overall sampling time. No dice.
Despite failing to reduce the sampling time of Stan’s version of HMC, I wrote up a report about my proposal. Sharing in case someone finds my attempts interesting. If nothing else, it was fun to play around with some of Stan’s internals.
Abstract
This technical report analyzes two mathematically equivalent versions of Hamiltonian dynamics within the Hamiltonian Monte Carlo (HMC) algorithm as implemented in the probabilistic programming language Stan. This report compares Stan’s HMC dynamics to an alternative writing of the same dynamics, both under the dense Euclidean metric. While the proposed reframing of Hamiltonian dynamics reduces the number of calculations within the HMC algorithm, there is no tangible reduction in the run time of the overall sampler for the tested target distributions. This is likely due to the fact that gradient evaluations of the target distributions dominate the run time.
5 Likes
Unfortunately the statistical literature on Hamiltonian Monte Carlo is largely a mess, which makes it very easy to be confused as to which ideas are new and which are old, and hence have already been considered.
The problem with papers like Ma et al and Tripuraneni et al is that they strive for generality without any understanding of whether or not that generality offers the potential for improved performance and, if so, how it could be identified in practice to fuel appropriate adaptation algorithms. It turns out that the actual potential for improvement is pretty narrow (it’s basically limited to the magnetic potentials considered in Tripuraneni et al) and nowhere near understood well enough to be useful in practice (magnetic potentials are really a way of incorporating additional symmetry into the Hamiltonian evolution, and whether or not the method is useful depends on the symmetries present in the target distribution). Fortunately there is a much richer theoretical foundation for these generalization that makes these issues more clear – look out for a new paper within the next few months…
Anyways, I say all of this is confusing because the_particular_ approach considered by @roualdes isn’t actually a generalization at all. It’s just a reparameterization of the momenta that’s been well-known and well-studied for a long time. For example Radford talks about it at the bottom of page 22 in https://arxiv.org/abs/1206.1901 and in fact it was used in some of the earliest versions of Stan.
Because it’s a reparameterization the discrete Hamiltonian dynamics will be exactly the same, at least up to floating point arithmetic. There’s often a misunderstanding that one form or another will be more or less numerically stable but ultimately they’re basically the same. @roualdes is correct that there are slightly different memory considerations but those differences are negligible relative to the overhead of the rest of the sampler. All things that were carefully considered in the early days of Stan.
3 Likes