Concentration of divergences


#1

In case of models with many parameters, it would be nice to get some automated estimate about the amount of concentration and in which direction, ie which marginal posteriors to check for the potential hints how to change the model or priors. I guess simple mean and variance compared to full sample mean and variance would be a simple start. I was also thinking of simple classification methods, ie, if classification method can predict which draws are flagged divergent, then they are concentrated and by analysing which parameters influence the classification prediction we could see which parameters to investigate. I don’t propose this as a formal method, but just as a way to make it faster to find where the problem is, and it’s still the modellers responsibility to get rid of divergences.

Aki


Trying to do hierarchical nonlinear model in Stan; it's slow and probably wrong
#2

Univariate marginals aren’t very useful for identifying problems – it’s not until you look at bivariate marginals that you start to see interesting structure. So then you need some means of comparing the distribution of non-divergent and divergent samples across each pairs plot. Trying to automate this then runs into issues because even for pathological models the number of divergences is small enough to make tests based on summary statistics noisy. I was never able to get something that was fast enough to check against all N(N-1)/2 pairs and accurate enough to automatically detector problems without too many false positives.

But people should experiment and suggest methods that have any empirical success.


#3

Even just something that could order pairs by a) variance of divergences vs non, and b) quantile of mean of divergences in non diverging distribution, would be pretty helpful I think. I’ll do it for R at ‘some point’, if nobody else gets to it first or has better suggestions…


#4

Parallel coordinates plot could work (maybe after normalization to uniform and with some clever sorting of parameters). And from there one could choose valid pairs to investigate further. Also ternary diagram replacing pairplot works, if one is familiar with them.


#5

The biggest problem I’ve had is that in many of my models with divergences the number of parameters is in the thousands, for a model with 5000 parameters, ~ 25 Million pairs need to be considered… good luck with doing that manually.


#6

Yeah… that’s why the automated idea appealed to me. It rattled around my head so I coded up a simple version. It returns a list with bivariate interactions sorted by a) (mean of divergences - mean of normal samps) / sd of normal samps, or b) sd of dvg / sd of normal. Seems helpful and reasonable in my single test case so far, but is kind of slow with many parameters… 15min for 1000*1000 samples. for any testing use the nupars argument!

stan_checkdivergences.R (2.4 KB)


#7

Since this topic seems to interest so many, but I don’t have time for further experiments you can find my quick experiment with logistic regression here https://www.dropbox.com/sh/14p9sagxq4jlh6r/AAD1Sy7nTl1PQlhDituziaBFa?dl=0
No need to look pairs, when you can look all variables at once and then focus to which ones are relevant. Good be made faster by not using MCMC, but this was quicker to make.

Aki


#8

https://imgflip.com/i/1u9jll


#9

Aki, how long would it typically take to fit the logistic regression to say 200 samples with 5000 parameters?

Also, all, can anyone tell me the precise technical definition of “divergent transition?” Is it just that the new location after the transition doesn’t have the same total energy as the previous location? Something else?


#10

A divergence occurs when the level sets of the shadow Hamiltonian are no longer compact. See the conceptual intro paper.


#11

I would not use so few samples. I would assume at least the default 4000. With 4000 samples and 5000 covariates rstanarm with MCMC is slow, but it’s likely that faster approximations could be used. But first we would need to test whether this is even useful idea in a smaller problems (but bigger than 8 schools) and then consider speed-up.

Aki


#12

Ok, so it’s not just that the integrator fails to preserve the Hamiltonian, but that the forward propagation results in unbounded error in the Hamiltonian? I assume this also means unbounded error in the kinetic energy, as if your kinetic energy goes to zero, you stop moving, so your error should not diverge in an unbounded manner.

Thinking about an example / physical analog. Suppose you have a 1 d ball bouncing around in a 1d box… the walls of the box are at 0 and 1 where the potential energy goes rapidly to infinity. If you have a lot of kinetic energy heading towards 1, and you discretize time, you could potentially jump over the infinity at x=1 and “tunnel” through to the other side of the infinite potential pole. Now, on the other side, you might have no potential at all, or even one that decreases continuously, so you start to fly forward forever, maybe accelerating.

Similar things might occur with something like a Lennard-Jones potential, where you get a little too close to zero distance due to discretization error, and then achieve sufficient energy to have the molecule explode out of the material

If that’s what occurs in a divergence, then I can certainly see the problem.

Here’s a thought based on some of my other ranting on my Momentum Diffusion HMC thread:

HMC is a great way to explore a wide range of possibilities, traveling across the landscape nicely. But, it has challenges, and sometimes we can’t seem to eliminate divergences. Suppose you have a problem where you can’t eliminate divergences. Still, as a first pass, an ensemble of HMC samples, even with divergences, gives sensible inferences, nothing looks “too far” from correct (like your parameters don’t go to infinity inappropriately, etc)

Now suppose you have some approximate, but divergent HMC sequence. Next, you choose one location at a time, and do several random-walk metropolis transitions. Do this repeatedly N times then output the final result and move to the next location in the HMC sample. The HMC ensemble which is biased by divergences should relax towards a better sample. The random-walk metropolis doesn’t suffer from the divergence problem as it has no momentum. In regions where the HMC wound up with erroneous levels of potential energy, the random-walk metropolis will anneal down to the appropriate level. In regions where HMC was doing fine, you just add additional noise which should potentially reduce auto-correlation.

If this idea has merit, would it be hard to create a Stan postprocessor which read in a Stan CSV file and carried out these postprocessing steps? Could you do it without all the auto-diff stuff to make it faster?


#13

Maybe even better because it could be implemented in parallel: use HMC in an umbrella sampling scheme. Have one chain doing diffusive metropolis sampling, and one chain doing some kind of approximate HMC (both operating on different cores). At each step, at random, propose to switch the coordinates between the chains. The HMC as long as it was even approximately sampling from the posterior would feed the diffusive metropolis new locations very frequently. On the other hand, the difficulties of doing correct HMC without divergences and soforth would be mitigated because as long as the approximate HMC had a stationary distribution, the correctness of the diffusive metropolis samples would be proven by the methods already used in diffusive umbrella sampling.


#14

You could even intentionally take lp__ / (1+epsilon) in the HMC sampler, and have epsilon be a tuning parameter to balance reducing divergences (by flattening the lp__ with larger epsilon) and maintaining a good chain-switching acceptance ratio by keeping epsilon as small as possible.

If you can’t easily tear out the autodiff stuff, the diffusive chain could use Langevin dynamics, utilizing the gradient that gets calculated anyway.


#15

Ok, further thoughts:

Suppose we have a scheme for sampling locations from a target distribution p(x), using some sort of HMC like process, but it has divergences, and/or it involves smoothing out regions where divergences occur, or it involves adding noise to the system in the process of reducing computational cost, or the like. In the end, we expect that it produces a biased sample, but we don’t know exactly what happens.

Suppose then that it samples from a density proportional to p(x)b(x) where p(x) is our desired target density, and b(x) is a bounded non-negative function biasing the distribution.

Let’s further suppose that the bias is in some sense small. Maybe in terms of a Kullback-Liebler divergence or an expectation of b(x) under p(x) or the like.

Now, suppose we have an exact, but potentially computationally expensive Markov process T(x,i) which converges to p(x) exactly and doesn’t suffer from the biases (for example, random isotropic normal proposal diffusive Metropolis) but also doesn’t give as broadly distributed a sample in the same wall time. It moves like a random walk.

Let {x_i} be a sample from the original HMC process, then:

{T(x_i,j)} is a family of samples which converges as j increases to a sample from p(x)

Now, suppose we have a family of expectations we are interested in, for example, mean and sd of each parameter, or the means of certain group level hierarchical parameters, or the mean lp__ or what have you. Our concern is to eliminate biases in these inferences caused by the difficulties experienced in sampling using our original sampler.

So, we run {T(x_i,j)} forward several steps in j, calculate the expectations, fit a linear regression to each sequence of expectations, and test to see if the ensemble of trend-lines through the expectations can be rejected as a sample from a jointly normal distribution with mean 0 vector and sd of each entry estimated from the se of each estimate. If the trends in expectations stay statistically “flat” for several iterations of this process, we assume that we have a sufficiently converged sample.

Now, how does this help us?

Suppose HMC wants small step sizes, and has divergences, as is the case in models such as the ones @Charles_Driver and I have been running.

One way to deal with this is to set adapt_delta very high, say 0.99999 and take very small step sizes, use very long treedepth, run a long adaptation to get a decent mass matrix estimate, and use a lot of wall clock time to get an exact sample from HMC. In my problems this could be literally days or weeks of computing to get a sample without divergences. Or, perhaps I can never get a divergence free sample.

Another way to deal with the issue, is to set adapt_delta low, say 0.8, set treedepth low, say 7, take a large number of iterations in less wall time (say 1000 iterations, treedepth 7), which costs 128000 function/gradient evaluations, compared with treedepth say 15 which would do the same number of evaluations in 4 iterations. And recognize that the sample is biased. Then, using an exact metropolis transition, evolve the entire ensemble until the expectations of interest flatten out.

Suppose that gradient evaluation takes 0.03 seconds, which is typically the kind of thing Stan spits out. Then 2000 iterations at treedepth of 12, takes 3 days. 200 iterations at treedepth of 7 takes 13 minutes. If I have a budget for say 15 minutes of computing time, I now can do 2 minutes of metropolis, and I don’t have to calculate any gradients. So let’s assume not doing gradients drops computational time to 0.003 seconds per function eval. I can now do 200 iterations of metropolis on each of the 200 samples.

Furthermore, there’s no reason I can’t parallelize the metropolis transitions, so I can read in N samples from a file, and then ask different processes to run metropolis chains on each one for K steps, and then output the new samples, lather, rinse, repeat. Wall clock time is close to say 2/4 minutes for 4 cores.

Now, I’m sure there are other very good ideas about how to get fast unbiased samples, such as Michael’s Adiabatic sampling etc. But, at the moment, as a practical way to handle divergences and small step sizes and soforth, does anyone see any major impediment to correctness or implementation of this sanitization procedure?

Also, @Bob_Carpenter, can the Stan libraries etc be made to produce a function that simply evaluates lp without evaluating the gradients? If it were possible to call a function that would read a Stan model, and compile me a C++ function that evaluates lp without gradients, I might actually be able to write this sanitizer thingy.


#16

In programmatic terms, it’s when the difference between the current log Hamiltonian and the original log Hamiltonian is greater than some constant. I think it’s currently set at a very liberal threshold, so only really extreme divergences trigger it.

If the Hamiltonian diverges, random walk Metropolis is almost certainly going to fail. Just imagine the neck of the funnel. You’ll never get anywhere without tuning to the curvature at every point you’re at. Rather than throwing out the geometry and going to Metropolis, you are better off going to Riemannian HMC, which can adapt the geometry locally.


#17

Yes, the compiled C++ class representing the model evaluates the log density for double or any of our autodiff types. If you plug in double, no autodiff.

If there was anything we saw that we thought would work, we’d be investigating it.


#18

Thanks Bob, this was actually the gist of my question.

Hamiltonian MC diverges precisely because when it takes a large step in a narrow funnel, it accidentally penetrates into a region of high potential energy, and then gets kicked by a mule out to infinity. I sort of suspect that the biggest problem will be not so much that there’s a lot of probability mass inside the funnel, but that approaching the funnel from one side will keep you from getting to the probability mass at the other side, and vice versa. My theory is to propose a set of starting points for Metropolis that are near equilibrium. Get the set of starting points from an HMC like construct, but relax the requirements of the HMC like construct to sample close to uniformly on the typical set. My strategy for divergences is nonlinear viscosity. If we get hit by a sledgehammer by stepping too far into high potential energy regions, we’ll get shot out at high velocity, but the viscosity will suck the energy out so we don’t wind up at infinity… the same goes for if we accidentally get too much energy sucked out of the system and fall down into the well, we’ll use that as a signal to resample the momentum so the total energy is back in the typical set range. In the end, all the failure to HMC properly will tend towards gaining entropy. If I constrain to the typical set, and gain entropy with time… I’ll wind up sampling from effectively uniform on the typical set.

Then, once we have a large number of initial points spread around evenly in the typical set, sub-sampling proportional to the density will give a sample close to equilibrium, and then chaining forward with simple metropolis steps will let me detect convergence… at least all this is in theory. It’s a strategy specifically designed to deal with situations where Stan takes a long time and still has divergences, so that I don’t know what I’m getting from Stan anyway.

It’s all in theory right now. But I’m working in Julia on an example problem and learning quite a bit from that. Since my system doesn’t try to preserve the Hamiltonian, rather accepting noise in the whole process, I am able to use SPSA to calculate approximate gradients so for my test case I don’t have to go full auto-diff.


#19

Riemannian HMC uses the curvature you need to do the kinds of adjustments you’re talking about to second order.


#20

What’s the current estimated timeframe for RMHMC in Stan? I’m sure you’re right that it would be much better. I assume the big hold-up is 2nd order auto-diff, and the big advantage is large step sizes in regions of low curvature, and basically no mass matrix estimation needed. That would be awesome, but it seems like the infrastructure needed to do it right is pretty nontrivial.