@Bob_Carpenter

Yes, I do know that pretty much anything you think of in MCMC has been tried. I’ve read a moderate number of articles on these things. I also have solved lots of actual mechanical problems using ODEs so I have a sense of how mechanical problems behave. For the most part, I just use Stan ;-)

But, I also do things for which

Is not true. And the advantage of being able to range over long distances and have momentum can fail. Let me give you a simple example. In RNA-seq data you sequence something like 30 million short RNAs and then you look at the sequence and see which gene it came from and then produce a set of counts for each gene. Now, some genes are just really highly expressed in a certain tissue. So perhaps of the 30 million reads 1 million come from a certain gene.

On the other hand, there are say 20,000 genes. So some genes have 0, 1 or 10 reads.

Now, you’re trying to simultaneously infer an underlying fact about the genes given these count data. In some genes the posterior distribution of this number is like 0 ± 2

In other genes the posterior distribution of this number is 13 ± 0.01 and in other genes the posterior distribution is 2 ± 0.1

Now start Stan at a random location. It’s a little like an enormous board with 20,000 slots in it that 20,000 balls are rolling around in. We let it fall down into the potential wells… Suppose it’s starting with diagonal mass matrix 1. Along the 0±2 dimension, it happily oscillates around in its little final resting spot, no problem. In another slot, once it hits 2 ± 0.1 it gets a divergence because it steps over the 0.1 width well and slams into the wall at the other side… so it drops the step size, and it gets to 0.0002 step size and it finds 2 ± 0.1 just fine maybe, but it’s taking 3 minutes per sample and using treedepth 15.

Now, it continues with step size 0.002 trying to hit 13 ± 0.01 and it takes forever. If it’s estimating a mass matrix it will decide that this dimension has big variance because it’s basically a wide flat trough. It’s just been going the same direction the whole time and eventually it will hit the other side and have to come back all the way the other way… Then it finally gets to 12 and steps right over the whole well to 14 where it’s just slammed into the most enormous wall ever, and diverges.

Now, after a while because its step size is wrong and divergences etc its final distribution for this thing is something like 12, and the chain was stuck because it just kept slamming against that hard wall up at 13.01 and bouncing off.

Now, from my perspective, there are 20,000 genes with widely varying posterior variances. Sure, I can rescale things so that all the means are O(1) or so, but I can’t examine all 20,000 variances and give Stan a re-parameterization for each one etc.

On the other hand, if I try to do this with Metropolis, forget it. The random walk diffusion never finds even the vicinity of the correct posterior vector. Getting out to 12 takes a decade.

So, in this kind of context, if I can use HMC to just get me in the vicinity of the posterior vector of 20,000 measurements, and then bang around in the tiny highly identified regions… I do better than if I use HMC the whole time because the momentum of HMC causes oscillations and divergences that I can’t manually accomodate.

I’ve worked now on 3 or 4 of these problems, where I’m trying to simultaneously infer information about a large number of objects, and some of them are highly identified, and some of them are not, and inevitably I can’t avoid divergences. I fiddle around with it adjusting parameters etc, but Stan always has at least one of the 5000-20000 dimensions where it just can’t handle the dynamics.

Now, yes, Riemannian manifold could maybe do it. In fact I look forward to that a lot! But, as is, I think sometimes the momentum that is a huge help at getting out far away from the initial point and finding the general region of the high probability set, is a hinderance once you get there due to divergences caused by highly identified parameters.

Now, if I use HMC to go find the vicinity of the typical set, it is going to do it quickly compared to the decade that it takes Metropolis, it’s once it gets near there that it gets stuck and has divergence problems. On the other hand, if it’s in the vicinity of the typical set which is reasonably well identified and has widely ranging posterior variances, then in many cases, diffusive metropolis converges exponentially. A two-temperature parallel tempering scheme can help this convergence a lot by acting a lot like the importance sampling you mentioned.

From this perspective, HMC is acting mainly like a search-mechanism for the typical set. Once I’m in the typical set, it’s so narrowly defined that diffusive metropolis could be just fine, and in some sense, better because its lack of divergence. It will just bang around in these tiny wells. So my favorite thing would be to start HMC at 100 initial points, run it until it finds the typical set in each one, and stop, followed by some diffusive metropolis / parallel tempering *on the resulting ensemble*. Also, this ensemble technique is trivially parallelizable.

And that’s almost exactly what you see above. HMC here is approximate, but it doesn’t matter, because I’m not relying on it to give me the stationary distribution, I’m just relying on it to find me an ensemble of starting points that together are approximately in equilibrium… and so if any one diffusion can’t go far, it doesn’t matter because there are 100 or 1000 of them all sprinkled around the typical set.

I wonder if my experience is unusual, or if others have the same issue. This continual divergences and tiny step sizes problem occurred for me with a simultaneous model for 2000+ census locations, with a model for alcohol consumption for 2000 people who answered a CDC survey, and with a model for gene expression on 6000 genes. Each one has this feature that certain subsets of measurements are highly HIGHLY identified, and others not so much. Also there’s usually some higher-level parameters that tie the measurements together which are moderately identified, but get stuck.