Fast / Slow dynamics and widely varying degrees of identification

I think, having done a bunch of example problems with my approximate stochastic HMC stuff, that I’ve identified a common issue to many of the high dimensional problems I’ve encountered using Stan. (Note, Stan is fantastic software, and often does a very good job even despite these types of issues, yet, as dimensions grow, there are more and more opportunities to hit this snag).

The issue is that often the posterior distribution of the various variables has widely varying scales. In this case, the difficulty is that to integrate the Hamiltonian system stably requires small time steps, and this is true even if there is no difficult correlations to follow in the high dimensional space.

Consider the following simple molecular analogy. We have an ideal gas in a box consisting of 3000 point-like molecules, and one di-atomic molecule with tight binding between the atoms. Now, if we treat only the point-like molecules, suppose that the mean time between collisions of these molecules is something O(1), so we can use time steps O(1) and see a few thousand collisions in computing time O(1000) (where think here of the O notation as meaning times or divide by a factor smaller than around 4).

Now, suppose we add this one di-atomic molecule with tight binding. Suppose that the oscillation time of the molecular bond is O(epsilon << 1) then in time 1 between collisions, we have O(1/epsilon) oscillations, a very large number. We now need to use time steps O(epsilon) and for say epsilon = 0.001 in our 1000 units of computing time, we see only 1 molecular collision between molecules. Essentially all the molecules hang in the air, while our computer spends all its cycles calculating the furious vibrations of our one di-atomic molecule. If we don’t use this time-step, the di-atomic molecule over-steps its vibration boundaries, the nuclei sit on top of each other, and we artificially create massive amounts of energy and get a massive explosion (a divergence).

Now, considering a typical Stan model, a “fast” dynamic occurs when some particular parameter’s posterior distribution can be approximated as a gaussian with very tight standard deviation. For example, suppose you are doing something like measuring simultaneously several thousand measurements related to some scientific process (say for example temperature at 2000 personal weather stations or something). You have several measurements from each weather station, and some locations have wide swings in temperature, say a desert with overnight lows of 55F and mid day highs of 110F, whereas other locations in the tropics vary from between 73 at night to 74 in the daytime.

Now, if you’re looking at mean daily temperature, in order to sample from the full posterior you’ll need to move the desert temperatures through an oscillation varying from 55 to 110, whereas the tropical temperature moves through an oscillation varying from 73 to 74. If you sample unit momentums, and do not yet know what the mass matrix should look like, your time-step will be O(0.1) to stably sample the tropics, but will take O(400) time steps for 1 effective sample from the desert area.

Worse yet, starting at a random initial condition, you will be FAR out of equilibrium, let’s say you’ve got T=8305F in the tropics, and in this tightly constrained dimension, you will experience extreme forces, causing you to gain extreme velocities as you head towards T=74. At these high velocities, in order not to overshoot your T=74 and fly deep into divergent territory, you will need timesteps something like dt * v = 0.1 and v may be thousands instead of O(1), so just to get initially into the typical set may take you say 400,000 timesteps, taking a few hours before you even can reliably start to estimate a mass matrix. Now, if you force Stan to do short treedepth it will bleed the excess kinetic energy out of the system, helping you a lot to get into the typical set as the velocity will remain O(1) or at least say O(10) or the like, but it will then have insufficiently long treedepth later in the computation. Still, it’s a decent way to get initialization points.

Unfortunately, with say 2000 different regions you’re trying to model, you really ahead of time have no idea what the posterior distribution of each one looks like, if you did you wouldn’t need Stan! And so manual rescaling of the parameters to give Stan in-code hints on the mass matrix that would help a lot here, is not practical. In essence, the low-scatter “tropics” are like a very heavy molecule, one that even with enormous forces on it, doesn’t gain crazy velocity. Whereas in a real physical simulation we are not free to set the masses and the potential binding energies, since the Stan version is entirely invented, we could in fact just imagine a high mass for this tropical parameter if we knew we should be doing so.

The initialization problem can sometimes be improved by starting from something like the maximum a-posteriori estimate, or the VB estimate or the sample means or whatever, but this still leaves us with fast-slow dynamics during the trajectory. Once Stan gets a decent estimate of the posterior variance it can estimate a mass matrix, but with widely varying scales, if we don’t let Stan run for quite a long time, these scale estimates are very poor for dimensions that are not yet in equilibrium.

In the end, working with problems involving thousands of parameters some of which are highly identified and others not, strategies for coping with this problem are very welcome.

Any thoughts on addressing this issue?

1 Like

The deeper problems are that

  1. diagonal scaling isn’t sufficient for problems with correlation (e.g., bivariate normal with high correlation)

  2. no static scaling is sufficient for problems with varying curvature (e.g., hierarchical models)

In both cases, we have to be conservative and wind up with very small step sizes.

There is a lot of work that people have done on slow-fast Hamiltonian integration for the purposes of things like Molecular Dynamics. This paper is a nice example:

https://arxiv.org/abs/0908.1241

It would be interesting to have a stiff ODE solver for some of our stiff Hamiltonian systems. I take it the “fast” and “slow” here are the usual conditioning issue w.r.t. stiffness.

If you manage to code something that works over our large suite of examples better than our current system does, we’ll certainly incorporate it into Stan itself. Michael has a repo with MCMC evaluations set up for about a dozen models which we know are well behaved so that we can figure out what the right answer is.

Yes, stiffness fast/slow etc are all about the same thing, namely needing tiny timesteps even though many of the variables are well behaved and don’t change rapidly. For example you could think of modeling ignition of a fuel/air mixture. To get the chemical reactions you have to see the molecular collisions, so solve the ODE for all the molecules at femtosecond scales, but the reaction occurs say over 1 second (lighting your stove for example), and to see the flame front travel takes 1/10 of a second, and to get to steady state takes 3 seconds. So you need 3 trillion femtosecond timesteps. Well, the details of the collisions aren’t that important, so you could discover their statistics by running a few hundred iterations of the femtosecond scale, and then project the flame front forward in time at the 1/100 second scale based on those statistics, then re-start to get new statistics on the fast dynamics, etc.

I’m going to continue to look at this because I think it’s relevant, and will come back from time to time with any results I get, but it may take quite some time, and if RM-HMC becomes available, the need for it may go away.

It looks to me like the linked paper is suggesting that you could get useful results if you can turn off and on the “fast” dynamics. Doing that is a little like block-updates of the system. If you have K fast variables, you update everything at tiny timesteps for a while. Then you freeze the K fast variables, and do the HMC dynamics on the other N-K for much longer timesteps. Then repeat the whole process.

The problem with this, is that we don’t know which are the fast and which are the slow variables (and of course, in some sense these can change with position as in hierarchical models).

You might be able to add something to the Stan language to hint to the compiler how to block the variables, and what timestep you think they might need:

vector [100] foo HINT(dt=O(epsilon), block=1)
vector [10] bar HINT(dt=O(epsilon^3), block=2)

Then you could update in blocks, Stan would infer the base size of dt=epsilon, and would try other blocks with dt of the order of magnitude given by the dt hint.

Or some such thing.

Riemannian Hamiltonian Monte Carlo (RHMC) is available in C++, though we haven’t thoroughly tested all the special function higher-order derivatives, the densities and basic math libs are all well tested.

The bottleneck for RHMC is that each leapfrog step is a cubic operation in the number of parameters.

Andrew’s been suggesting something similar that we could use to initialize scales of the different parameters. You need to know the posterior scales, not the prior scales, and as you say, it can vary dramatically around the posterior, which is where we usually run into problems. But we do also see exactly this kind of issue with stiffness. And it interacts in subtle ways I don’t quite understand with the no-U-turn condition.

Yikes! The models where I really need this stuff have well over 1000 parameters. For example 2000 people * a 7d simplex for my alcohol model, and 2500 public use microdata areas for my family expenses and income model. If I assume the autodiff stuff currently is say O(N) in parameters, this suggests that RMHMC will be several million times slower per step, so unless it gets me the same Neff in 1 millionth of the number of steps… it isn’t going to help my typical case.

Also I assume it’s probably inverting some matrix that’s O(N^3) in time? and so the matrix storage is probably N^2 and now we’re talking a quarter of a gigabyte of storage, and on 4 chains, we’re talking a gig of storage. Sure storage is cheaper these days, but …

Anyway, thanks for the heads up on that. It makes finding some possible alternatives more attractive.

Yes, see Michael Betancourt’s arXiv paper on computing metrics for HMC.

There are diagonal approximations (but that loses a lot of power), and something could be done cleverly with blocking and/or sparsity, but those are all research and/or hard engineering problems themselves.