Momentum Diffusion HMC?


Depending on the model, the more data you have, the better identified some of your parameters can be, and therefore the farther into the tail you tend to start, and the smaller the step-size you’ll need, thereby reducing the mixing rate when it comes to the less well identified parameters.

It sounds to me like you and I are experiencing similar types of issues, I guess the question is whether the diagnosis is correct, and whether any of the ideas I had for improving things would actually work. It’s all very sketchy at the moment.

Once we can re-start Stan runs reusing mass matrices, you could potentially do something like run this run you show to 500 iterations using short treedepth, then re-start at the final locations, with long treedepth and the mass matrix estimated previously, to get a good quality run. That would be at least a temporary type hack.

Do you have any luck using vb to find initialization points that reduce the explosion and ringing effect?


never had much luck with vb on these models, I would sooner just do the short treedepth thing for inits…


VB rarely works for me as an inference procedure on these problems, but it works as a way to get inits that aren’t too “explosive”. I think the ideal situation is

  1. vb init
  2. short treedepth HMC for a few samples (10 to 50?) to get into typical set.
  3. Use final sample from (2) as init for HMC with long treedepth and long run with sufficient adaptation.

(1) and (2) get you quickly to the typical set, whereas (3) gets you inference that isn’t too auto-correlated. by the time you’re doing (3) you’re in territory where NUTS/virial based sampling really is maximizing your neff, hopefully.


August 17

VB rarely works for me as an inference procedure on these problems, but it
works as a way to get inits that aren’t too “explosive”.

This is what the pymc3 devs have been suggesting (at least at one point)
and I can definitely see it working well in specific cases, but I’m not
sure VB will give you that for enough of the models Stan users are fitting
that it would be worth recommending this approach for general users. (I’m
not saying you were recommending that.)

I think the ideal situation is


I agree


strange internet


Screenshot from 2017-08-23 09-35-55

Here you can see an approximate, and true sample from a 2D distribution I’m calling “Scorpio”. It has 50% of the mass in the body, 10% in each tail segment and 30% in the stinger. Each segment is a multivariate normal, so the exact sample is just sampling independently from the mixtures.

The upper sample is from a kind of pseudo-hmc in which position is updated according to momentum, but momentum is updated according to a gradient calculation based on the SPSA approximation (requires basically 1 lp eval, though I average 2) plus a random normal perturbation, plus a viscous control force that is proportional to the error in the hamiltonian (too high, and it slows you down, too low and it speeds you up).

At the end of a trajectory, the samples are weighted by exp(-(Herror/delta)^2 + lp-maximum(lp)) and a location is chosen at random. delta is a tunable parameter. Because of the control force, you revisit the neighborhood of the average energy constantly along each trajectory (it’s a mean-reverting random process)

In all the cases I’ve looked at, it works pretty well and samples from a slightly increased entropy distribution, something like the original distribution convolved with a not very wide gaussian. In addition there’s a tunable trigger which completely resamples the momentum if the total energy change in one step is bigger than a certain amount. This prevents divergences from ejecting you from the high probability region. So far, I haven’t needed this trigger. So I don’t know how well it handles divergences.

This is all implemented in about 100 lines of julia code, and can sample approximately from arbitrary distributions by providing an lp function. Since it uses SPSA it doesn’t have to do any fancy auto-diff computing for getting gradients. But, since it uses SPSA it inevitably has noise in the gradient… and so my solution to that was to add additional noise to ensure the noise is more gaussian, and a feedback cruise-control to utilize the noise process to ensure that the system maintains its energy on average. If you plot potential and kinetic energy they go up and down the way they should, plus have a bunch of noise. If you plot total energy it looks like a white noise around 0 change.

This sample is 5000 samples from trajectories of 5000 steps and timestep of 0.002, there is just unit mass matrix, and initial momentum is sampled on unit normal scale at the beginning of each trajectory. Total time to compute 25 Million lp evals in Julia was something like 5 minutes I wasn’t paying attention very closely. Timing 5000 evals of the lp function takes 5 microseconds per eval or so. Calculating a gradient should take on the order of the same time as a function eval. So 25 million evals should have taken on order of 2 minutes, so 5 minutes makes some sense.

I have no proofs of anything, but it’s clear that the system samples from something similar to the target distribution. Q-Q plots are elongated and slightly wavy and soforth, as you’d expect when the thing only approximately samples.

Now, in my opinion, as approximations go, compared to VB this thing does a really nice job on fairly complex distributions. VB in this case would find one gaussian to smush over the whole image. Even if you gave it the opportunity to use a full covariance matrix, it’d still do a terrible job here.

I’ve got various ideas on what to do after running this, my first is to calculate a weighted average lp and sd of lp then sub-sample from those samples with lp in a trimmed range. Followed by exact metropolis. Every single point should be a point that is starting within the equilibrium distribution, and so some not very many metropolis steps later, you should have a very good sample. Metropolis won’t suffer from divergence issues, but it’s pretty stupid and won’t visit as widely as this scheme will in a given time. on the other hand, this scheme won’t sample from your desired distribution, it’ll only sample from a fuzzed out version of your desired distribution.

How does any of this help? Well, I don’t know. But I thought it was shockingly effective so I figured I’d put the ideas out there. I’ll probably put some details onto my blog shortly.

Concentration of divergences

Note, I’m pretty sure Stan would do this sample exactly, and in way less time. Stan is awesome. Most likely it wouldn’t diverge even in the corner region of the tail. I’ll run it and see what happens and post the picture and timing.

Is there even a case for my algorithm given Stan exists? It’s actually quite questionable, the main issue is whether the fuzzy nature of the diffusion HMC de-constrains the system enough to get through divergence inducing regions better, at the expense of giving you a sample from a not-quite-right but not too bad distribution which you could potentially then equilibriate down to correct.

Also, it’s pretty cool that you can get something HMC like without needing exact gradients by embracing noise. Anyway, I learned some Julia in the process ;-)


Screenshot from 2017-08-23 12-44-43

Upper image is the exact sample, lower image is the metropolized approximate sample after about 30 seconds of metropolizing. It’s eliminated the long tails pretty well.

Here’s the Q-Q plot of the post-metropolis x and y coordinates:

X coords:

Screenshot from 2017-08-23 12-46-29

and Y:

Screenshot from 2017-08-23 12-46-44

Note that it’s wiggly in a region where there’s relatively little data, y in 5-14 it’s those scorpion tail segments that contribute a total of 20% of the mass.


When you think about it, diffusion Metropolis is pretty darn bad at moving around the distribution, but it’s pretty darn great at equilibriating something that’s sampled from a slightly fuzzed out distribution. If you’re in the tails of the real distribution, and you make your step size small enough so you get ~ 50% acceptance, then because Metropolis always accepts upwards steps, Metropolis will walk right up the tail no problem provided the tail isn’t too light. In the core, it’ll just jiggle stuff around until it equilibriates any local lumpiness.

The big problem is if you don’t visit the main probability masses in the right proportions in the first place. For example if the approximate transition moved up the scorpion tail into the stinger, and then spent most of its time trapped there so that instead of close to 30% of the samples it had 45% or something like that. Metropolis won’t correct those kinds of biases, because it’s not going to cause points to flow down from the stinger into the tail and back into the main 50% body mass, at least not in a reasonable amount of time (in infinite time it’s guaranteed to do so but so what?).

This whole thing makes clearer what the problems are with divergence. Suppose that the entry to the tail was narrow and we started near the body, then we’d get into the body, but because the tail entry was very narrow, we’d never climb up the tail to the stinger, as we’d get divergences as soon as we got near the tail entry. Effectively the system becomes multi-modal.

Of course, if we run multiple chains from very diffuse starting points, we can detect this multi-modality.

One strategy I can think of that the momentum diffusion technique could perform, is to sample the momentum from a much too hot initial place, and then equilibriate down using the viscosity control. This could help you penetrate through certain barriers. In the end, when we’re sampling from the trajectory weighted by exponential quadratic difference from the proper energy, the portion of the time when we were “too hot” gets ignored.

From the perspective of near-invariance. If we start at lp0 and sample a momentum from normal(0,1) and then set our target energy, then we multiply our momentum vector by 2 or 3 or 4… and let-er-rip, it will take a transient deep punch into the high potential region, followed by bouncing back or penetrating to “the other side” but in either case eventually mean-revert to the proper energy by stochastic control.

Of course, you can’t do this if you want your system to be sampling directly from the appropriate distribution the way Stan does, but you can do it if you are willing to sample from something else.

A strategy for equilibriating multi-modal problems might be the following:

Take your full sample, and split it into two sub-samples, say by taking every other sample (why not, they’re auto-correlated anyway). Now at each time step, select one sample from group A and one from group B. With probability 1/2 either do a diffusion metropolis step on each one, or attempt to move A to B and move B to A.

In replica 1, if going from A to B increases your density, accept, otherwise accept with probability p(B)/p(A). Similarly in replica 2 if going from B to A increases density, accept, otherwise accept with probability p(A)/p(B).

Eventually if you have blobs with relative masses that are incorrect, you’ll stream mass between the different parts of the distribution until it’s in equilibrium.


What’s “SPSA”?

Unimodal approximations are usually pretty bad at modeling multimodal distributions :-) [Not that it’s stopped the Bleis and Jordans and Hoffmans of the world from building VI for LDA, but then I’d argue they’re not really doing Bayesian inference.]

This is always the problem with multimodal sampling. There is a substantial literature on this problem, including Michael Betancourt’s adiabatic sampler and some work Andrew’s been doing with Yuling Yao on path sampling.

That’ll depend on how much curvature there is. It will not be great at equilibrating the neck of the funnel.

Importance sampling works well, too, if it’s only “slightly fuzzed out”.


Simultaneous Perturbation Stochastic Approximation: it calculates an approximate gradient numerically using a single function eval.

It was developed for stochastic optimization, but it seems to work well here provided you get control of the noise.


Well, if the initial sample goes too deep into the high potential energy regions anyway, then there will be some excess mass in the funnel already, and it will be Metropolis’ job to pull things OUT of the funnel, which I think it should do ok.


This needs rethinking. because the above is incorrect, and doesn’t maintain the target distribution, it actually causes everything to concentrate into the high density regions. When both ensembles are in equilibrium, randomly swapping any 1 point from each does nothing to the distribution, regardless of what the two densities are. It is an interesting thing to think about, how to speed convergence by swaps between parallel chains. For example, accepting swaps that move each ensemble towards the mean lp across both, or something. The parallel tempering method for example, with one replica updating at slightly tempered wider distribution, and then things can move around more in that replica, swaps then can be accepted according to the appropriate parallel tempering probability.


Yes, parallel tempering did a good job of correcting the kinks in the Q-Q plots, tempering half the sample at T=3.5 vs the other half at T=1. The T=1 sample winds up removing the kinked biases observed above in the y coordinate (scorp 2 plot). I also think that this could work well for finding your way into a funnel. The higher temperature replica can penetrate deeper into those forbidden regions.

The huge advantage of HMC is its ability to visit widespread regions. The advantage to metropolis is its dogged banging against the region right where it currently is. If you start with an ensemble that is wide-spread in the typical set, but mildly biased, if you bang it around with metropolis it does a decent job of correcting things. The parallel tempering with one replica at higher temperature is basically the importance sampling you mention. In the limit, the higher temperature in the second replica means p2(x) = exp(lp(x)/T for T > 1, produces a spread out version of the target distribution. Then swapping individual samples from the ensemble allows you to bang against regions like funnels where the higher temp replica will go deeper into the funnel area automatically.


I hope you realize there is a huge MCMC literature which has spent a long time considering things like parallel chains and tempering and warping spaces, and combining little steps from this and that.

It’s hard to see how any of this is going to do better than HMC, which uses the gradients to follow the flow through the typical set.

There’s no advantage to doggedly banging around where you’re at. You want the sampler to move to produce high effective sample sizes, not spend its time in one place heavily autocorrelated.



Yes, I do know that pretty much anything you think of in MCMC has been tried. I’ve read a moderate number of articles on these things. I also have solved lots of actual mechanical problems using ODEs so I have a sense of how mechanical problems behave. For the most part, I just use Stan ;-)

But, I also do things for which

Is not true. And the advantage of being able to range over long distances and have momentum can fail. Let me give you a simple example. In RNA-seq data you sequence something like 30 million short RNAs and then you look at the sequence and see which gene it came from and then produce a set of counts for each gene. Now, some genes are just really highly expressed in a certain tissue. So perhaps of the 30 million reads 1 million come from a certain gene.

On the other hand, there are say 20,000 genes. So some genes have 0, 1 or 10 reads.

Now, you’re trying to simultaneously infer an underlying fact about the genes given these count data. In some genes the posterior distribution of this number is like 0 ± 2

In other genes the posterior distribution of this number is 13 ± 0.01 and in other genes the posterior distribution is 2 ± 0.1

Now start Stan at a random location. It’s a little like an enormous board with 20,000 slots in it that 20,000 balls are rolling around in. We let it fall down into the potential wells… Suppose it’s starting with diagonal mass matrix 1. Along the 0±2 dimension, it happily oscillates around in its little final resting spot, no problem. In another slot, once it hits 2 ± 0.1 it gets a divergence because it steps over the 0.1 width well and slams into the wall at the other side… so it drops the step size, and it gets to 0.0002 step size and it finds 2 ± 0.1 just fine maybe, but it’s taking 3 minutes per sample and using treedepth 15.

Now, it continues with step size 0.002 trying to hit 13 ± 0.01 and it takes forever. If it’s estimating a mass matrix it will decide that this dimension has big variance because it’s basically a wide flat trough. It’s just been going the same direction the whole time and eventually it will hit the other side and have to come back all the way the other way… Then it finally gets to 12 and steps right over the whole well to 14 where it’s just slammed into the most enormous wall ever, and diverges.

Now, after a while because its step size is wrong and divergences etc its final distribution for this thing is something like 12, and the chain was stuck because it just kept slamming against that hard wall up at 13.01 and bouncing off.

Now, from my perspective, there are 20,000 genes with widely varying posterior variances. Sure, I can rescale things so that all the means are O(1) or so, but I can’t examine all 20,000 variances and give Stan a re-parameterization for each one etc.

On the other hand, if I try to do this with Metropolis, forget it. The random walk diffusion never finds even the vicinity of the correct posterior vector. Getting out to 12 takes a decade.

So, in this kind of context, if I can use HMC to just get me in the vicinity of the posterior vector of 20,000 measurements, and then bang around in the tiny highly identified regions… I do better than if I use HMC the whole time because the momentum of HMC causes oscillations and divergences that I can’t manually accomodate.

I’ve worked now on 3 or 4 of these problems, where I’m trying to simultaneously infer information about a large number of objects, and some of them are highly identified, and some of them are not, and inevitably I can’t avoid divergences. I fiddle around with it adjusting parameters etc, but Stan always has at least one of the 5000-20000 dimensions where it just can’t handle the dynamics.

Now, yes, Riemannian manifold could maybe do it. In fact I look forward to that a lot! But, as is, I think sometimes the momentum that is a huge help at getting out far away from the initial point and finding the general region of the high probability set, is a hinderance once you get there due to divergences caused by highly identified parameters.

Now, if I use HMC to go find the vicinity of the typical set, it is going to do it quickly compared to the decade that it takes Metropolis, it’s once it gets near there that it gets stuck and has divergence problems. On the other hand, if it’s in the vicinity of the typical set which is reasonably well identified and has widely ranging posterior variances, then in many cases, diffusive metropolis converges exponentially. A two-temperature parallel tempering scheme can help this convergence a lot by acting a lot like the importance sampling you mentioned.

From this perspective, HMC is acting mainly like a search-mechanism for the typical set. Once I’m in the typical set, it’s so narrowly defined that diffusive metropolis could be just fine, and in some sense, better because its lack of divergence. It will just bang around in these tiny wells. So my favorite thing would be to start HMC at 100 initial points, run it until it finds the typical set in each one, and stop, followed by some diffusive metropolis / parallel tempering on the resulting ensemble. Also, this ensemble technique is trivially parallelizable.

And that’s almost exactly what you see above. HMC here is approximate, but it doesn’t matter, because I’m not relying on it to give me the stationary distribution, I’m just relying on it to find me an ensemble of starting points that together are approximately in equilibrium… and so if any one diffusion can’t go far, it doesn’t matter because there are 100 or 1000 of them all sprinkled around the typical set.

I wonder if my experience is unusual, or if others have the same issue. This continual divergences and tiny step sizes problem occurred for me with a simultaneous model for 2000+ census locations, with a model for alcohol consumption for 2000 people who answered a CDC survey, and with a model for gene expression on 6000 genes. Each one has this feature that certain subsets of measurements are highly HIGHLY identified, and others not so much. Also there’s usually some higher-level parameters that tie the measurements together which are moderately identified, but get stuck.


With these problems, they’re almost, but not quite N independent random variables. For example in the census problem, i’m trying to find 2000+ functions that describe how expenses relate to family size. You could almost do this as 2000 separate regressions, but if you did that you couldn’t incorporate spatial pooling and inference about how scaling with family size is common across regions, etc. Still, once HMC finds you the typical set, everything is pretty narrowly identified, and metropolis on an ensemble of 100 state vectors would equilibriate the whole thing in the time it takes HMC to complete a single sample along its enormous treedepth = 15 trajectory with step size 4e-6. Typically I can use VB followed by HMC to get into the typical set in something like 10 Stan samples, maybe 30 seconds to 3 minutes. But once there, I suspect due to a few highly identified dimensions, the step size crashes and getting 100 samples takes a day. On 4 chains maybe 2 of them will be completely stuck, and 2 will actually give fine inference but take that full day.

On the other hand, if I want 100 samples I could do 100 random widely dispersed starting points, 30 seconds to 3 minutes each, parallel on 4 cores, each one will be in the typical set. If I can do approximate HMC and it doesn’t calculate expensive gradients, all the better. Then metropolizing the ensemble into full equilibrium might take another 3 minutes, or it might take 10 minutes… but I don’t think it will take a day, and I think it will have less chance of getting stuck, particularly with the parallel tempering as a way to mix the ensemble well.

Now that I have an approximate-HMC-ifier, I will try this on one of my sticky high dimensional models and report back… Biggest issue is writing a manual lp function in Julia for my models. The work you guys have put into unconstraining and reconstraining and just doing things really well is very apparent when you start to try to do some of it manually!


Thinking about parallel tempering in this context here’s why I think it makes sense:

Suppose you do approximate HMC with numerical SPSA gradients and momentum noise starting from a position generated from your prior, and going until you get into the vicinity of the high probability region by virtue of having a high acceptance rate along your trajectory. Then you move around in this region for a while very approximately, not attempting to converge exactly to the correct distribution. By the time you’re done, you have a kind of blanket over the high probability region but you’re spending too much time in the tails. At least though, when your posterior is way more identified than your prior, you’re in the right region of space compared to whatever you could do for inits on your own.

Now, some of these approximate HMC samples are in the actual typical set, and some are outside it at higher entropy/potential energy. If you randomly partition these samples into two ensembles, and then metropolize these ensembles using parallel tempering with one ensemble at higher temperature, each time you try a swap, and find a pair where the main ensemble has something too deep in the tail, you WILL swap these molecules around because it will be energetically favorable to put that tail molecule from the main ensemble up into the higher temperature ensemble. So, it’s like a Maxwell’s Daemon for the samples, it assorts them into “those close to the right temperature T=1” and “those that make more sense at higher temperature T=1+b” for some positive b. This process could be quite quick because it occurs with probability 1 when the energy change is favorable, and in the beginning, if you have a blanket over the region, you will have plenty of opportunity for favorable ejection of tail samples from the main ensemble, annealing the main ensemble down to the right distribution relatively rapidly. In fact, when I ask my ensemble above to do it 20n times, it rapidly eliminates the samples in the tail of the T=1 ensemble in favor of ejecting them to the higher temp T=3.5 ensemble.

Initially in this thread my idea was to blanket the typical set evenly, and then sample proportional to density to importance sample down to a close distribution. But what I’ve found is my SPSA-HMC is actually better than that. it doesn’t blanket the typical set with uniform probability, it gets kind of close to the correct probability, and this is a problem, because now if I want to importance sample, I need p(x)/q(x) where q(x) is the density I sampled from, and I don’t know it. it’s not uniform on the typical set, it’s more like p(x) fuzzed out. Well, the parallel tempering / Maxwell’s Daemon idea takes care of that, using p(x)^(1/T) as a fuzzed ensemble to eject samples into assorting the locations into those that make sense in the p(x) ensemble, and those that make sense in the higher temperature wider ensemble.

In the mean time, you’re diffusing around a bit, and as long as you’re in the right region of space, you don’t have divergences because no momentum, and so you can’t get partially ejected by a bad step size. As you say, Stan only triggers a divergence for a fairly large one. That means when you’re on the edge of stability, Stan might be taking largish pseudo-divergent steps which eventually converge back to correct, or which take a while to diverge, thereby using up a lot of computer time doing something oscillatory and ultimately unhelpful, when it could just be diffusing around, finding the highly identified region that it’s right next to. With small steps you can’t travel far, but this also means you can’t pseudo-diverge. If you do a lot of the far-traveling up-front, with large step sizes and fast convergence to the general vicinity of the high probability region… followed by a combination of Maxwell’s Daemon to assort the locations, and minor diffusion that never diverges… it seems likely this will solve the kind of problem I’ve seen where Stan just will not sample in reasonable time without divergences, and demands step sizes of 1e-6 and has wildly varying estimates of the mass matrix depending on what starting points it randomly chose.


This is getting focused on the wrong thing. When Stan diverges is exactly when something like Metropolis-Hastings will fail to sample from the correct distribution.