I think, having done a bunch of example problems with my approximate stochastic HMC stuff, that I’ve identified a common issue to many of the high dimensional problems I’ve encountered using Stan. (Note, Stan is fantastic software, and often does a very good job even despite these types of issues, yet, as dimensions grow, there are more and more opportunities to hit this snag).
The issue is that often the posterior distribution of the various variables has widely varying scales. In this case, the difficulty is that to integrate the Hamiltonian system stably requires small time steps, and this is true even if there is no difficult correlations to follow in the high dimensional space.
Consider the following simple molecular analogy. We have an ideal gas in a box consisting of 3000 point-like molecules, and one di-atomic molecule with tight binding between the atoms. Now, if we treat only the point-like molecules, suppose that the mean time between collisions of these molecules is something O(1), so we can use time steps O(1) and see a few thousand collisions in computing time O(1000) (where think here of the O notation as meaning times or divide by a factor smaller than around 4).
Now, suppose we add this one di-atomic molecule with tight binding. Suppose that the oscillation time of the molecular bond is O(epsilon << 1) then in time 1 between collisions, we have O(1/epsilon) oscillations, a very large number. We now need to use time steps O(epsilon) and for say epsilon = 0.001 in our 1000 units of computing time, we see only 1 molecular collision between molecules. Essentially all the molecules hang in the air, while our computer spends all its cycles calculating the furious vibrations of our one di-atomic molecule. If we don’t use this time-step, the di-atomic molecule over-steps its vibration boundaries, the nuclei sit on top of each other, and we artificially create massive amounts of energy and get a massive explosion (a divergence).
Now, considering a typical Stan model, a “fast” dynamic occurs when some particular parameter’s posterior distribution can be approximated as a gaussian with very tight standard deviation. For example, suppose you are doing something like measuring simultaneously several thousand measurements related to some scientific process (say for example temperature at 2000 personal weather stations or something). You have several measurements from each weather station, and some locations have wide swings in temperature, say a desert with overnight lows of 55F and mid day highs of 110F, whereas other locations in the tropics vary from between 73 at night to 74 in the daytime.
Now, if you’re looking at mean daily temperature, in order to sample from the full posterior you’ll need to move the desert temperatures through an oscillation varying from 55 to 110, whereas the tropical temperature moves through an oscillation varying from 73 to 74. If you sample unit momentums, and do not yet know what the mass matrix should look like, your time-step will be O(0.1) to stably sample the tropics, but will take O(400) time steps for 1 effective sample from the desert area.
Worse yet, starting at a random initial condition, you will be FAR out of equilibrium, let’s say you’ve got T=8305F in the tropics, and in this tightly constrained dimension, you will experience extreme forces, causing you to gain extreme velocities as you head towards T=74. At these high velocities, in order not to overshoot your T=74 and fly deep into divergent territory, you will need timesteps something like dt * v = 0.1 and v may be thousands instead of O(1), so just to get initially into the typical set may take you say 400,000 timesteps, taking a few hours before you even can reliably start to estimate a mass matrix. Now, if you force Stan to do short treedepth it will bleed the excess kinetic energy out of the system, helping you a lot to get into the typical set as the velocity will remain O(1) or at least say O(10) or the like, but it will then have insufficiently long treedepth later in the computation. Still, it’s a decent way to get initialization points.
Unfortunately, with say 2000 different regions you’re trying to model, you really ahead of time have no idea what the posterior distribution of each one looks like, if you did you wouldn’t need Stan! And so manual rescaling of the parameters to give Stan in-code hints on the mass matrix that would help a lot here, is not practical. In essence, the low-scatter “tropics” are like a very heavy molecule, one that even with enormous forces on it, doesn’t gain crazy velocity. Whereas in a real physical simulation we are not free to set the masses and the potential binding energies, since the Stan version is entirely invented, we could in fact just imagine a high mass for this tropical parameter if we knew we should be doing so.
The initialization problem can sometimes be improved by starting from something like the maximum a-posteriori estimate, or the VB estimate or the sample means or whatever, but this still leaves us with fast-slow dynamics during the trajectory. Once Stan gets a decent estimate of the posterior variance it can estimate a mass matrix, but with widely varying scales, if we don’t let Stan run for quite a long time, these scale estimates are very poor for dimensions that are not yet in equilibrium.
In the end, working with problems involving thousands of parameters some of which are highly identified and others not, strategies for coping with this problem are very welcome.
Any thoughts on addressing this issue?