Concentration of divergences

In programmatic terms, it’s when the difference between the current log Hamiltonian and the original log Hamiltonian is greater than some constant. I think it’s currently set at a very liberal threshold, so only really extreme divergences trigger it.

If the Hamiltonian diverges, random walk Metropolis is almost certainly going to fail. Just imagine the neck of the funnel. You’ll never get anywhere without tuning to the curvature at every point you’re at. Rather than throwing out the geometry and going to Metropolis, you are better off going to Riemannian HMC, which can adapt the geometry locally.

Yes, the compiled C++ class representing the model evaluates the log density for double or any of our autodiff types. If you plug in double, no autodiff.

If there was anything we saw that we thought would work, we’d be investigating it.

Thanks Bob, this was actually the gist of my question.

Hamiltonian MC diverges precisely because when it takes a large step in a narrow funnel, it accidentally penetrates into a region of high potential energy, and then gets kicked by a mule out to infinity. I sort of suspect that the biggest problem will be not so much that there’s a lot of probability mass inside the funnel, but that approaching the funnel from one side will keep you from getting to the probability mass at the other side, and vice versa. My theory is to propose a set of starting points for Metropolis that are near equilibrium. Get the set of starting points from an HMC like construct, but relax the requirements of the HMC like construct to sample close to uniformly on the typical set. My strategy for divergences is nonlinear viscosity. If we get hit by a sledgehammer by stepping too far into high potential energy regions, we’ll get shot out at high velocity, but the viscosity will suck the energy out so we don’t wind up at infinity… the same goes for if we accidentally get too much energy sucked out of the system and fall down into the well, we’ll use that as a signal to resample the momentum so the total energy is back in the typical set range. In the end, all the failure to HMC properly will tend towards gaining entropy. If I constrain to the typical set, and gain entropy with time… I’ll wind up sampling from effectively uniform on the typical set.

Then, once we have a large number of initial points spread around evenly in the typical set, sub-sampling proportional to the density will give a sample close to equilibrium, and then chaining forward with simple metropolis steps will let me detect convergence… at least all this is in theory. It’s a strategy specifically designed to deal with situations where Stan takes a long time and still has divergences, so that I don’t know what I’m getting from Stan anyway.

It’s all in theory right now. But I’m working in Julia on an example problem and learning quite a bit from that. Since my system doesn’t try to preserve the Hamiltonian, rather accepting noise in the whole process, I am able to use SPSA to calculate approximate gradients so for my test case I don’t have to go full auto-diff.

Riemannian HMC uses the curvature you need to do the kinds of adjustments you’re talking about to second order.

What’s the current estimated timeframe for RMHMC in Stan? I’m sure you’re right that it would be much better. I assume the big hold-up is 2nd order auto-diff, and the big advantage is large step sizes in regions of low curvature, and basically no mass matrix estimation needed. That would be awesome, but it seems like the infrastructure needed to do it right is pretty nontrivial.

Here is an example of the parallel coordinates plot.

Divergences are shown in green.

Top plot contains non-normalized samples and bottom plot density equalized samples


That is pretty nifty!

Trying to understand what that plot is telling me, and it looks like it’s saying that all the divergent transitions have tau ~ 0

what did you do to “density equalize” the bottom plot? just subtract the min and then divide by the max so that the samples all range from 0-1?

One nice thing here is that I could actually imagine doing this for ~ 2000 parameters, or at least something similar. whereas 4 Million pairs plots is out of the question.

I transformed the sampling density to uniform with inverse cdf.

def equalize_arr(arr):
    """Equalize array (n, m), 
       n=number of samples
       m=number of parameters
    n, *m = arr.shape
    ravel = False
    if not m:
        arr = arr[:, None]
        ravel = True
    ind = np.indices(arr.shape)
    sort_order = np.argsort(arr, axis=0)
    sort_back_order = np.argsort(sort_order, axis=0)

    sorted_arr = arr[sort_order, ind[1]]
    sorted_csum = (sorted_arr - sorted_arr.min(0)).cumsum(0)
    sorted_csum /= sorted_csum[-1, :]
    equalized_arr = sorted_csum[sort_back_order, ind[1]]
    if ravel:
        equalized_arr = equalized_arr.ravel()
    return equalized_arr

One can do other kind of normalizations too, but if the sample space is closer to log than linear it can show all the values concentrating near 0.

Plotting code

extract = fit.extract(permuted=False, inc_warmup=False)
ext = extract.reshape(-1, extract.shape[-1], order='F')

divs = np.concatenate([item['divergent__'][fit.sim['warmup']:] for item in fit.get_sampler_params()])
divs_bool = divs.astype(bool)
ext_eq = equalize_arr(ext)

div_samples = ext[divs_bool, :]
nondiv_samples = ext[~divs_bool, :]

div_samples_eq = ext_eq[divs_bool, :]
nondiv_samples_eq = ext_eq[~divs_bool, :]'ggplot')
plt.plot(nondiv_samples.T, color='k', lw=1, alpha=0.02);
plt.plot(div_samples.T, color='lime', lw=1, alpha=0.1);

plt.xticks(np.arange(ext.shape[-1]), fit.sim['fnames_oi'], rotation=90)
plt.grid('off') #  due to ggplot style


plt.plot(nondiv_samples_eq.T, color='k', lw=1, alpha=0.02);
plt.plot(div_samples_eq.T, color='lime', lw=1, alpha=0.1);

plt.xticks(np.arange(ext.shape[-1]), fit.sim['fnames_oi'], rotation=90)
plt.grid('off') #  due to ggplot style

1 Like

RHMC is in the released C++ version of Stan now, but it’s not available in any of the interfaces.

All the C++ library functions are fully tested for higher-order autodiff, but we haven’t tested all the Stan functions thoroughly enough for us to want to release it.

This is fantastic.

It would be even better on the unconstrained scale (where that tau = 0 is probably really log tau -> -infinty), but that’s not yet easy to produce.

The [0, 1] is odd, but clearly easier to scan. Would standardizing be better (subtract mean and then divide by standard deviation)?

I’m still trying to understand the prominent sawtooth pattern.

@betanalpha — see any problems with this?

Assuming not,

@jonah — consider this a feature request for BayesPlot and ShinyStan.


@Bob_Carpenter @ahartikainen Yeah, I think this offers a perspective that we really don’t get from any of the diagnostic plots we typically make. I don’t see any reason the same thing couldn’t be wrapped up into a convenient R function for bayesplot. Unless anyone sees a problem we’re overlooking,I will open an issue so it officially gets on the to-do list.


My concern here is the false positive aspect. We can talk ourselves into tau being bad because we already know it should be bad, but if you looked at the plot without any prior bias would you think the other parameters look okay? Why should be be considered okay here? It seems like the non-divergence iterations bifurcate whereas the divergent ones don’t, so it’s not like they’re consistent or anything.

Also let me emphasize again that in practice it’s usually the correlation plots that help you identify what part of the model is causing trouble so I’m not sure how useful this plot will be. Ultimately, the test will be not applying to the 8 schools model but rather on new problems and see if it ends up being helpful.

Yes, one could do that. The only problem is that some values are closer to log-scale. So it would look like almost all the points are in one place.

Me too. Is it an artifact from the transformation or a real phenomenon?

I agree with this. In my opinion, this plot should be (if used) the first step to identify problematic areas.

This plot should be tested a bit more to see if it is really useful or not.

Also, do notice that the normalization function assumes that there are no duplicate points in the sample. If there are, they will be a shifted due to cumulative function. Not really a problem if one only does visual stuff (n>1000), but it should not be used anywhere else.

You can see some of that from the lines through the graph. In fact, I think that’s the advantage of this. You can sort of see it all at once with the green lines through the black/grey.

Isn’t the boundary behavior going to be fishy?

Yo Ari, I got a model for you to test out. Tell me what’s causing my divergences! I know how to fix it, so your challenge is to figure that out.

Link here:

The file is ‘westbrook.R’. The data gets loaded up and model executed in the first 16 lines. The rest is diagnostics and plots. As the model is configured, you should get 50~ divergences (4 chains, 2000 iters each).

It’s an approximate GP technique I was testing out. The story is:

I took Russell Westbrook’s made-missed shots from the 2015 season (or at least most of it – not sure if the dataset is complete). It’s an r/nba think to talk about how in 4th quarters of games Westbrook turns into Worstbrook (as opposed to Bestbrook), so the idea is maybe we could estimate Westbrook’s shooting percentage not as a single number, but as a function of the game time (and see the difference in Bestbrook and Worstbrook)

(it’s from this thread, but don’t read too far, cause this has the answer in the text: Approximate GPs with Spectral Stuff)

It took me awhile to work out where the divergences were coming from because I wasn’t sure if it was the approximation that was causing the divergences, or something weird with the data (and an exact model would have caused divergences), or something else. Anyway, seems like a good level 2 crackme. (edit: and I sure would have appreciated something more than pairplots when I was debugging this)


Re the prominent sawtooth pattern. Each of the theta is probably a mixture of gaussians through the mu mean and tau sd. Let mu and tau be fixed. Then as a mechanical analog, thetas are independent harmonic oscillators with equal periods and random phases. In essence, balls on springs hanging from a stiff metal plate, bouncing up and down.

Now, un-fix mu by instead of dangling the plate from the ceiling on a rope, dangle it from a spring. Then the harmonic oscillators are all coupled together, their motion affects the motion of mu. When the oscillators all pull downward, the plate moves downward, when they all push upward, the plate moves upward. The coupling through mu causes partial synchronization of the oscillators. In order to have a constant mu, they need to have about half of them going up, and half of them going down. If mu has smaller posterior sd than thetas which it should (about 1/sqrt(8)) then for a diagonal mass matrix, it has 8 times the mass and its velocity which is p/m is 1/8 as much so it’s kind of constant relative to the oscillations of the theta.

This is all a guess. But I think it makes reasonable sense. I now think we need a Stan T-shirt with 8 balls dangling from a plate dangling from the ceiling.

Tau here corresponds to the stiffness of the springs. When tau is small, it means you don’t stretch the springs much compared to equilibrium position, so your oscillation amplitude is small. When tau is large then the springs are soft, and you can wiggle around a lot. So, imagine you’re going along with a tons of balls wiggling on springs, and then suddenly all the springs freeze up into hard stiff steel rods (tau goes to 0). BANG all the balls need to snap back to the center forcefully. If you’re simulating this, you need to accelerate the balls dramatically, and then move the balls with a LOT of velocity. For any reasonable step-size you will be unable to get your balls back to the right place, they will overshoot, which corresponds to them stretching a steel rod by a lot more than they would in reality, which corresponds to then applying a restoring force on the balls that is way way way too large, which corresponds to turning around and overshooting again, compressing the rod by way too much, which corresponds to turning around… and diverging out to infinity after a few cycles.

Finally, imagine that in addition to the spring forces on the balls, there are winds blowing around, so that the ball trajectories are randomly perturbed by additional forces. These additional forces are the forces from SPSA approximate gradients in my discussion here:

Now imagine each ball internally has some kind of computer mechanism and some mass it can wiggle around inside the ball using battery energy. We let it wiggle with random normal perturbations on regular, extremely fast intervals. This is the random noise I added to the momentum step in my sampler. Finally, we have all the ball computers wired together so they can detect their total energy. when their total energy drifts a little too low, they run their internal mechanism adding some velocity to all the balls in the direction of current travel, proportional to the current velocity. when the total energy is too high, they all apply a viscous drag on their spring to slow themselves down. This is the control viscous force.

Next, we take snapshots of the system at regular time intervals, but we throw away all the snapshots where the system didn’t have within epsilon of the right total energy that it had at the beginning of the run… Finally, at the end of all of it we randomly choose one of the remaining snapshots with probability proportional to exp(-TotalPotentialEnergyInSprings)…

That’s a mechanical description of my momentum diffusion approximate HMC I used in linked thread above.

As someone who had no idea what was being plotted, tau was the one parameter that jumped out at me as being weird. Make of that information what you will

1 Like