@betanalpha and other HMC experts
I’ve been thinking a bunch about why my large dimensional models have issues in getting into the typical set and sampling without divergence, and why large treedepth is needed etc. I see in the manual that simplexes often require these small steps for stability…
But in thinking about all of this it occurred to me that one possible solution to the long treedepth and soforth would be some kind of momentum diffusion / thermostat algorithm.
As I understand it, HMC lifts the parameter space (location) to a product space in which we have location and momentum… iterates the proposal forward and backward… then projects back down to location space and chooses one of the points on the trajectory according to multinomial distribution based on the lp (something like take the lp along the trajectory, subtract the max for numerical stability, exponentiate, then normalize to sum = 1). The whole trajectory is on a single energy level.
Now, suppose instead of discrete jumps in momentum space (lift, evolve, project) where lift and project are discrete jumps… instead we had something like a diffusion in momentum?
call q location and p momentum then we can have something like (ignoring the correct specifics of symplectic integrators, hopefully experts would fill in details here appropriately)
q[i+1] = q[i] + dt p[i]
p[i+1] = p[i] + dt F(q[i]) + dp[i]
where F, the generalized force, comes from the negative gradient of the log-probability and dp is a random perturbation of the momentum at each time point.
All this reminds me of some ideas I read about a while ago by Denis Evans regarding thermostats in molecular dynamics. In particular, it seems that we want our trajectory to be such that the potential energies are all about the same and equal to the potential energy in the typical set. So, the idea would be to seek out a “temperature” for a “thermal bath” that leads to our trajectories flying around in the typical set. Then, after flying around for a while, simply sub-sampling the one LONG trajectory using multinomial method with probability determined by the lp of the points would get us a sample that “works”
Furthermore, this would actually use all the intermediate points along the path, instead of just one point from each level set.
I recognize this is pretty hand-wavy, but it’s just an intuition I had that I though I’d throw out there to see if anyone had already thought about this, or if there are useful directions to go in related to this intuition.