Getting the location + gradients of divergences (not of iteration starting points)


#1

When I was writing the non-identifiability blog post, I really tried to think hard where divergences are expected to occur. And while the divergences somewhat gathered in the areas I expected them too, they were also often marked quite far from those area. I noticed that the Stan manual says:

ShinyStan provides visualizations that highlight the starting point of divergent transitions to diagnose where the divergences arise in parameter space.

So this means that if an iteration is divergent, Stan keeps the location in parameter space where the iteration started, possibly multiple integration steps before the actual divergence occurred? If this is correct interpretation, would it be possible to export the locations of the start of the integration step that diverged? In addition, the gradient at this point (or, equivalently, the attempted step) could be stored. My best guess is that this would be a more illustrative diagnostic when overlaid over the plots of the samples, clearly outlining regions with problematic geometry.

Typical example is this pairs plot:


I strongly suspect that divergences actually occur at the boundary of the sampled region as I expect the curvature there to be much higher than in the middle of the sampled region (this is from the sigmoid model that does have only the two parameters shown).

If you don’t see any problems with this idea, I might try to add it to Stan. I imagine having a separate, optional stream for just this data.


#2

Yeah, I also came accross this one. I think currently this is not so ideal to only get the starting points reported, because these can be far away from where it actually happened. However, from the perspective of the sampler is makes sense as we throw away the failed trajectory and instead record the starting point as the new accepted draw (I hope I got that right). A crude work-around could be to limit the treedepth such that HMC cannot explore too far away from the starting node.

I am curious what others say - I think you are right in that this behavior doesn’t make it easy to spot the actual parameter configuration which triggers the hickup (at least for some problems; for funnel like things this is probably not so much of an issue as being close to the funnel calls for trouble).


#3

IT totally makes sense to store the iteration starting point in the samples. That’s why I’d suggest adding a separate output stream for divergence (and possibly other) diagnostics.


#4

We’ve talked about this before. I think it’s a good idea but it’s deep the sampler and doing it right would be a challenge given potential performance issues. Additionally a divergence is not at a point, it’s based on accumulated error over a trajectory exceeding a threshold. You could return the trajectory and enable some analysis on it to highlight where things go off the rails


#5

So I did a quick and veeeeery dirty hack to get the locations where base_hmc::divergent_ becomes true (ignoring the trajectory). And it looks helpful!

his is the same model as above, green triangles are transitions without divergence, blue squares are the diverging iterations as currently reported, red circles are the “ends” of diverging transitions (points where base_hmc::divergent_ became true). Now the sharp boundary where I would suspect the divergences to happen is clearly visible. And it is IMHO easy to get out of the sampler without any performance hit, the code gets executed only for divergent transitions and is basically the same thing that writes the normal transitions, so even if everything diverged, it will just double the I/O…


#6

When a divergence is identified the dynamic trajectory expansion terminates and a new state is chosen from the trajectory that has been build up to that point – it does not just take the initial point.

But more importantly as @sakrejda notes, a divergence is not a point but a property of the trajectory itself. There is no discrete threshold below which the trajectory is fine and above which it’s bad – it’s the entire trajectory (really the entire modified energy level set) that’s pathological. We define an arbitrary threshold tuned to catch these divergent trajectories but the location where the trajectory passes that threshold is not necessarily indicative of the neighborhood where the problem first arose.

This is why the recommendation is to focus on where the divergent trajectories concentrate, for which our best proxy is observing the samples that came from divergent trajectories.

Trying to extract more information about the divergent trajectories is straightforward but there’s no place to put that information. In particular, you cannot replace the carefully-chosen sample with an arbitrary point from the divergent trajectory as that will immediately break detailed balance, and we don’t have any addition sources where dynamic information like this could be dumped.


#7

Do you have a (preferrably simple) model, where you would expect the “last drop” point to not be informative? I can now (with some minor pain) run this through this hack to see, if it is actually the case? Saving the whole trajectory might be difficult to do without impacting performance, but maybe saving the last point could be useful with little overhead?

I imagine this could be implemented as an additional but optional output stream (not mixed with samples), so balance would not be an issue.


#8

@martinmodrak Do you have this code on Github somewhere?

Is this really a good idea though? I’m disinclined to believe anything about the posterior when there are divergent transitions. It’s my understanding that the statistics are practically dead at that point (unless we could run the sampler to infinity or whatever), and in that case interpreting the behavior of the integrator as just an integrator seems easier.

I guess this example is a little easy in the sense that there are two parameters and Martin already had an explanation for what was happening going in. It’s hard to argue the current diagnostics (in his first post) have much helpful info in them in this case.

I’m pretty sure my success rate at helping people get rid of divergences is lower than my 3pt%, and I’m terrible at basketball.

Anyway if it’s a software engineering problem, I get it, but I’m not buying that the current state of things is good.


#9

In the HMC diagnostics section of Visualization in Bayesian workflow we give an example of a situation when divergences aren’t such a terror. Sometimes there are “false positive” divergences that don’t indicate pathologies. Persistent clustering of divergences is always a bad sign though.


#10

I’ve pushed it to a new feature/extended-divergences-reporting branch. (hope it’s okay to have a branch with hacks in the main repo). It simply creates a file called “divergences.csv” in the current dir and writes the last step of the divergent transitions there (so don’t run multiple chains at once).


#11

Divergences indicate that the chains are not geometrically ergodic and MCMC estimators do not satisfy a central limit theorem, but the chains are still valid Markov chains. Divergences are not an excuse to do whatever we want to the chains, especially when people consistently ignore them.

The default RStan output is not particularly helpful at identifying the behavior of divergences when they concentrate near a boundary as the individual points cluster on top of each other and it becomes difficult to see the distribution. Redoing the plot with transparency the concentration of divergences near the boundary using the existing samples is evident already, and it doesn’t require plumbing through another diagnostic output.


#12

I managed to run cmdstan with the hacked stan to reduce pain when experimenting.
This is what the diagnostic looks like for the centered 8-schools example (same as above, “sampled” are the divergences as returned currently by Stan, “end” are the last-drop transitions where divergent became true).

This looks a bit less nice, but still IMHO shows better that the problem is concentrated for small tau (note the log scale).

I also tried the second sigmoid variant from the post:

Here, the transition ends nicely show that the problem is caused by w=0, whereas the current reporting doesn’t make that very visible (although it is clear that w=0 is special).

I am running out of diverging models I understand - do you have a model I should try?

Also, if desired, I think that it should be possible to get the whole trajectories out of Stan without any performance compromises - the trick is to amend the diagnostic output and store RNG state after each iteration. Having the params on the unconstrained scale + the RNG state should let me exactly reconstruct the iteration on demand. The sampler would just need a separate mode where it outputs the whole trajectory (or every n-th point or something).

I understand, if this is something you don’t want to invest energy into now (including not wanting to invest energy in understanding it better or reading my rants :-)). Or if you don’t like the idea of additional plumbing because the maintainence is already hell. But right now I feel that this additional diagnostic could be useful and am not convinced by the theoretical arguments, when I have practical examples where it is illustrative.

If you speak about my first example, than I disagree. You may note that for most of the boundary, there is a concentration of both divergences and non-divergences. And this is from personal experience: when I was first trying to solve this specific non-identifiability, it was within a larger model and I noticed the boundary, but since divergences were all over the place, I ignored it (and spotted the problem mostly by coincidence when playing with the math).


#13

@martinmodrak I realize your responding to Michael directly, so I’m just going to comment on the general situation you’re finding yourself in. We all (?) agree that there are a variety of ways of treating trajectories that contain divergent iterations. Your examples nicely show that sometimes you can get an improved understanding of the system by producing something other than what the sampler currently outputs.

I’m going to suggest an alternative path here because the main issue in Stan (and lots of other OS projects) development for good ideas is just developer time and priorities. I think there are two options (I’m partial to #2)

  1. Demonstrate that there’s a clearly superior diagnostic stream that can be generated for a wide range of models. The couple of models you have are a great prototype but lots of things that have potential for inclusion in Stan (e.g.-ADVI!) looked good to begin with but were really problematic long-term (combo of implementation quality, theoretical issues, maintenance burden). I realize that this is basically asking for a solid simulation study so that’s not great.

  2. Get involved in sorting out the I/O issues. Getting diagnostics out of Stan is hard to do in a hack-free way right now because all the output streams are convoluted. We’ve talked a number of times (there’s stuff on a Wiki) about getting better support in the core C++ code for split outputs (samples, moment, gradients, etc…) and typed error/diagnostic/warning messages. If that scene were cleaned up I think your suggestion does basically reduce to doing some careful plumbing to avoid performance problems. I’ll try to make a specific proposal based on the discussion so far and put it up on the Wiki this weekend.

Other than those two options (maybe there are others?) I think this is going to be a frustrating process.


#14

@martinmodrak, re: output plumbing: https://github.com/stan-dev/stan/wiki/Design:-Consolidated-Output-for-Sample,-Optimize,-ADVI,-and-Diagnose

Specifically check out what @syclik outlined for sampling outputs (I think we have like three writers right now that get plumbed in and we likely need more, or we need one output-thing and typed outputs so that one class can send them to the correct place.


#15

Yeah, that’s constructive, I like that, thanks :-) I had a bit of a steam when writing my last post, so I guess I might have come across as demanding action from you - sorry for that. Will look into the plumbing issues and see if I can commit to help sorting these out. I certainly think that the current output is the best use of the limited capabilities. The divergences endpoints are IMHO useful but only as additional output.

Also, if you happen to have some interesting diverging models+data, I’d be happy to collect them to have more examples and better understanding whether this diagnostic is worth the hassle.

I think you can also formulate a theoretical-ish argument why the divergence endpoints look good in the examples above: we are trying to choose one point to represent a whole trajectory. Choosing one such that the chain stays valid means we are choosing points that are in a sense close to other points in the chain. Choosing the ends means choosing the ones that are farthest away. The latter may help you spot, what is different about the problematic regions. You could also probably think about it as censored data.


#16

Don’t worry about it. It’s basically the iron law of the internet that it sucks for communication unless people already agree. Please do post if you look at the conversation so far and want to push it forward. It might be helpful to do a call focused on resolving this stuff so we can get a more fleshed out proposal to the group.

For a set of trajectories we could probably do some clustering to identify a region where the energy grows faster (or whatever metric gets thresholded, can’t recall RN) rather than worrying about which point to viz.


#17

That was a cool blog post. I haven’t read all the posts in here yet, but just for fun I made a heatmap of the log-likelihood with samples from Stan’s VI overlayed. I guess this is why we have to be careful about inferring posteriors using VI.

image


#18

Great idea! I often find that I end up with single digit numbers of divergent transitions, and it can be hard to see where they are concentrating. If I increase \delta, it can get rid of the transitions, but slows down the model, so end up with of trial and error to eliminate them. It looks like this gives a much better idea of where things go of the rails.

I had some in one of the model parameterizations in my StanCon submission. I reparameterized it to get rid of them. But it would be interesting to see what your analysis shows.