Momentum Diffusion HMC?


#41

Metropolis by itself won’t, because it won’t be able to find the whole typical set. And if HMC never worked, then I would hardly propose to first try HMC and then metropolize it all. But the fact of the matter is in my real problems, HMC works some of the time, and gets stuck some of the time, and the getting stuck seems to be a function of the initial conditions and interaction of the chain with the adaptation phase…

So my proposal is don’t rely on a small number of HMC chains, start HMC from randomly distributed initial conditions in the prior set, and let HMC run approximately and hopefully quickly (because SPSA gradients, and we refuse to get stuck, and we’re not trying too hard to converge to the appropriate distro, just something in some sense close). Then sort the ensemble using a Maxwell Demon (namely, a high temperature and a low temperature ensemble, so we can rapidly reject the tail samples to the high temperature ensemble) then metropolize the resulting ensemble until it gives stationary inference on important quantities such as certain means, stddevs and soforth.

Note also that if the HMC approximation that refuses to get stuck and operates on many initial points gives us an ok approximation, then we can sort it using the Maxwell-Demon and then estimate from it the appropriate Metropolis transition sizes, and/or a mass matrix for further exact HMC. If this works as an alternative adaptation scheme for Stan, then great!

Now, the convergence properties guaranteed for Markov Chains are only guarantees of the long run properties of single chains. Geyer goes on about this “One Long Run” philosophy

http://users.stat.umn.edu/~geyer/mcmc/one.html

And Geyer is in some sense right. But with HMC many medium runs is not “i. i. d. sampling from a slightly fuzzed version of the starting distribution.” And it’s not true that as Geyer says “Many short runs is only ``better’’ than one long run when both give rotten answers.” This is true because stuckness only happens some of the time, but it does happen. And, because the motion of HMC is not “fuzz” like, it’s dramatic non-diffusive motion. If you tried to sample my scorpion distribution by randomly selecting initial positions and Metropolizing until you were blue in the face… forget it. But approximate momentum-diffusive HMC did fine, but it didn’t sample from the exact distribution. On many of my problems I can get Stan to give me 1 or 2 chains in half a day that both give similar inference while 2 other chains do something unfortunate. If that works, then perhaps I could give Stan or something else say 10 minutes 100 times, and assort the resulting ensemble into high temp and low temp replicas, and then use the resulting information to estimate an appropriate mass matrix, choose 4 starting points, and send Stan to give me a perfect sample with no stuckness every time! That’d be great. Soon we’ll have the ability to input a mass matrix. Maybe this will solve my problems.

Stuckness seems to be a property of the mapping between random initial point, the weirdness of certain models, and Stan’s sophisticated attempts to warm-up. I could say that my proposals here are all about making inference fail more ductile-ly. When Stan gives me the right answer, it’s a fantastic answer, but when it fails, it just tells me nothing after 10 hours of computing and 2 KwH of electricity. I’ve learned to monitor convergence by watching the CSV files, and just killing the thing if it crashes in step-size and spits out divergences.

In any case, I think with all MCMC ideas the proof is in the pudding. If an idea consistently gives good results and we don’t have a theory for why, then we can work on the theory and learn something. If an idea seems to the inventor like it aught to do well, and the inventor doesn’t try it out on real world problems and show that it does work we never know really. If it seems like it should work, and you try it, and it doesn’t work… then you learn something too.

My proposal is really about using HMC as a search technique without demanding that it give samples from the correct distro, in exchange for ensuring that it doesn’t get stuck because you adaptively choose what to do… But adaptive methods don’t all give samples from the right distribution. But if that’s what you do, then how do you get closer to the correct distro? Metropolis on an initial ensemble with parallel tempering doesn’t seem like a bad idea, it’s a kind of adaptive importance sampling. It has the property that it’s got a mathematical guarantee of convergence.

In papers and things I’ve read about adaptive MCMC they mention the following proof: if you have an adaptive process, and you feed its output into a non-adaptive MCMC process, then the stationary distribution is the appropriate distribution. This follows because the stationary distribution of the final process is the appropriate distribution regardless of the starting point.

What HMC is great at is:

  1. Finding the typical set
  2. Moving around in the part that isn’t too curvy for its current step size

If you start from lots of initial points, adapt the step size a bit, relax the energy conservation, resample momentum when you hit a divergence, etc etc. it all breaks the convergence guarantee… But does it break property (1) and (2)? Not really, or at least it doesn’t have to according to at least 2 models I’ve tried it on. The scorpion one above, and a model where I have 1000 time points observation of a function with noise, and 3 measurements at each time point. This 1002 dimensional model where dimension 1 and 2 (the description of the function) creates correlations in all 1000 other dimensions, finds me points in the typical set after a single trajectory of 5000 approximate HMC steps, taking something like 10 seconds.

So, to be clear I don’t think this proposal replaces RM-HMC, nor do I think it lets us sample really weird curving posterior distributions with 1000 dimensional highly twisting manifolds, but I strongly suspect it could improve inferences on problems where currently there’s a ~ 25-50% chance your chain will get stuck during adaptation compared to running 4 chains and hoping none get stuck.