Getting the location + gradients of divergences (not of iteration starting points)


#1

When I was writing the non-identifiability blog post, I really tried to think hard where divergences are expected to occur. And while the divergences somewhat gathered in the areas I expected them too, they were also often marked quite far from those area. I noticed that the Stan manual says:

ShinyStan provides visualizations that highlight the starting point of divergent transitions to diagnose where the divergences arise in parameter space.

So this means that if an iteration is divergent, Stan keeps the location in parameter space where the iteration started, possibly multiple integration steps before the actual divergence occurred? If this is correct interpretation, would it be possible to export the locations of the start of the integration step that diverged? In addition, the gradient at this point (or, equivalently, the attempted step) could be stored. My best guess is that this would be a more illustrative diagnostic when overlaid over the plots of the samples, clearly outlining regions with problematic geometry.

Typical example is this pairs plot:


I strongly suspect that divergences actually occur at the boundary of the sampled region as I expect the curvature there to be much higher than in the middle of the sampled region (this is from the sigmoid model that does have only the two parameters shown).

If you don’t see any problems with this idea, I might try to add it to Stan. I imagine having a separate, optional stream for just this data.


Proposal for consolidated output
Best way to identify pathologies from divergent transitions (general step by step workflow)
#2

Yeah, I also came accross this one. I think currently this is not so ideal to only get the starting points reported, because these can be far away from where it actually happened. However, from the perspective of the sampler is makes sense as we throw away the failed trajectory and instead record the starting point as the new accepted draw (I hope I got that right). A crude work-around could be to limit the treedepth such that HMC cannot explore too far away from the starting node.

I am curious what others say - I think you are right in that this behavior doesn’t make it easy to spot the actual parameter configuration which triggers the hickup (at least for some problems; for funnel like things this is probably not so much of an issue as being close to the funnel calls for trouble).


#3

IT totally makes sense to store the iteration starting point in the samples. That’s why I’d suggest adding a separate output stream for divergence (and possibly other) diagnostics.


#4

We’ve talked about this before. I think it’s a good idea but it’s deep the sampler and doing it right would be a challenge given potential performance issues. Additionally a divergence is not at a point, it’s based on accumulated error over a trajectory exceeding a threshold. You could return the trajectory and enable some analysis on it to highlight where things go off the rails


#5

So I did a quick and veeeeery dirty hack to get the locations where base_hmc::divergent_ becomes true (ignoring the trajectory). And it looks helpful!

his is the same model as above, green triangles are transitions without divergence, blue squares are the diverging iterations as currently reported, red circles are the “ends” of diverging transitions (points where base_hmc::divergent_ became true). Now the sharp boundary where I would suspect the divergences to happen is clearly visible. And it is IMHO easy to get out of the sampler without any performance hit, the code gets executed only for divergent transitions and is basically the same thing that writes the normal transitions, so even if everything diverged, it will just double the I/O…


Gaussian mixture model with interval censored data -- where do the divergences come from?
#6

When a divergence is identified the dynamic trajectory expansion terminates and a new state is chosen from the trajectory that has been build up to that point – it does not just take the initial point.

But more importantly as @sakrejda notes, a divergence is not a point but a property of the trajectory itself. There is no discrete threshold below which the trajectory is fine and above which it’s bad – it’s the entire trajectory (really the entire modified energy level set) that’s pathological. We define an arbitrary threshold tuned to catch these divergent trajectories but the location where the trajectory passes that threshold is not necessarily indicative of the neighborhood where the problem first arose.

This is why the recommendation is to focus on where the divergent trajectories concentrate, for which our best proxy is observing the samples that came from divergent trajectories.

Trying to extract more information about the divergent trajectories is straightforward but there’s no place to put that information. In particular, you cannot replace the carefully-chosen sample with an arbitrary point from the divergent trajectory as that will immediately break detailed balance, and we don’t have any addition sources where dynamic information like this could be dumped.


#7

Do you have a (preferrably simple) model, where you would expect the “last drop” point to not be informative? I can now (with some minor pain) run this through this hack to see, if it is actually the case? Saving the whole trajectory might be difficult to do without impacting performance, but maybe saving the last point could be useful with little overhead?

I imagine this could be implemented as an additional but optional output stream (not mixed with samples), so balance would not be an issue.


#8

@martinmodrak Do you have this code on Github somewhere?

Is this really a good idea though? I’m disinclined to believe anything about the posterior when there are divergent transitions. It’s my understanding that the statistics are practically dead at that point (unless we could run the sampler to infinity or whatever), and in that case interpreting the behavior of the integrator as just an integrator seems easier.

I guess this example is a little easy in the sense that there are two parameters and Martin already had an explanation for what was happening going in. It’s hard to argue the current diagnostics (in his first post) have much helpful info in them in this case.

I’m pretty sure my success rate at helping people get rid of divergences is lower than my 3pt%, and I’m terrible at basketball.

Anyway if it’s a software engineering problem, I get it, but I’m not buying that the current state of things is good.


#9

In the HMC diagnostics section of Visualization in Bayesian workflow we give an example of a situation when divergences aren’t such a terror. Sometimes there are “false positive” divergences that don’t indicate pathologies. Persistent clustering of divergences is always a bad sign though.


#10

I’ve pushed it to a new feature/extended-divergences-reporting branch. (hope it’s okay to have a branch with hacks in the main repo). It simply creates a file called “divergences.csv” in the current dir and writes the last step of the divergent transitions there (so don’t run multiple chains at once).


#11

Divergences indicate that the chains are not geometrically ergodic and MCMC estimators do not satisfy a central limit theorem, but the chains are still valid Markov chains. Divergences are not an excuse to do whatever we want to the chains, especially when people consistently ignore them.

The default RStan output is not particularly helpful at identifying the behavior of divergences when they concentrate near a boundary as the individual points cluster on top of each other and it becomes difficult to see the distribution. Redoing the plot with transparency the concentration of divergences near the boundary using the existing samples is evident already, and it doesn’t require plumbing through another diagnostic output.


#12

I managed to run cmdstan with the hacked stan to reduce pain when experimenting.
This is what the diagnostic looks like for the centered 8-schools example (same as above, “sampled” are the divergences as returned currently by Stan, “end” are the last-drop transitions where divergent became true).

This looks a bit less nice, but still IMHO shows better that the problem is concentrated for small tau (note the log scale).

I also tried the second sigmoid variant from the post:

Here, the transition ends nicely show that the problem is caused by w=0, whereas the current reporting doesn’t make that very visible (although it is clear that w=0 is special).

I am running out of diverging models I understand - do you have a model I should try?

Also, if desired, I think that it should be possible to get the whole trajectories out of Stan without any performance compromises - the trick is to amend the diagnostic output and store RNG state after each iteration. Having the params on the unconstrained scale + the RNG state should let me exactly reconstruct the iteration on demand. The sampler would just need a separate mode where it outputs the whole trajectory (or every n-th point or something).

I understand, if this is something you don’t want to invest energy into now (including not wanting to invest energy in understanding it better or reading my rants :-)). Or if you don’t like the idea of additional plumbing because the maintainence is already hell. But right now I feel that this additional diagnostic could be useful and am not convinced by the theoretical arguments, when I have practical examples where it is illustrative.

If you speak about my first example, than I disagree. You may note that for most of the boundary, there is a concentration of both divergences and non-divergences. And this is from personal experience: when I was first trying to solve this specific non-identifiability, it was within a larger model and I noticed the boundary, but since divergences were all over the place, I ignored it (and spotted the problem mostly by coincidence when playing with the math).


#13

@martinmodrak I realize your responding to Michael directly, so I’m just going to comment on the general situation you’re finding yourself in. We all (?) agree that there are a variety of ways of treating trajectories that contain divergent iterations. Your examples nicely show that sometimes you can get an improved understanding of the system by producing something other than what the sampler currently outputs.

I’m going to suggest an alternative path here because the main issue in Stan (and lots of other OS projects) development for good ideas is just developer time and priorities. I think there are two options (I’m partial to #2)

  1. Demonstrate that there’s a clearly superior diagnostic stream that can be generated for a wide range of models. The couple of models you have are a great prototype but lots of things that have potential for inclusion in Stan (e.g.-ADVI!) looked good to begin with but were really problematic long-term (combo of implementation quality, theoretical issues, maintenance burden). I realize that this is basically asking for a solid simulation study so that’s not great.

  2. Get involved in sorting out the I/O issues. Getting diagnostics out of Stan is hard to do in a hack-free way right now because all the output streams are convoluted. We’ve talked a number of times (there’s stuff on a Wiki) about getting better support in the core C++ code for split outputs (samples, moment, gradients, etc…) and typed error/diagnostic/warning messages. If that scene were cleaned up I think your suggestion does basically reduce to doing some careful plumbing to avoid performance problems. I’ll try to make a specific proposal based on the discussion so far and put it up on the Wiki this weekend.

Other than those two options (maybe there are others?) I think this is going to be a frustrating process.


#14

@martinmodrak, re: output plumbing: https://github.com/stan-dev/stan/wiki/Design:-Consolidated-Output-for-Sample,-Optimize,-ADVI,-and-Diagnose

Specifically check out what @syclik outlined for sampling outputs (I think we have like three writers right now that get plumbed in and we likely need more, or we need one output-thing and typed outputs so that one class can send them to the correct place.


#15

Yeah, that’s constructive, I like that, thanks :-) I had a bit of a steam when writing my last post, so I guess I might have come across as demanding action from you - sorry for that. Will look into the plumbing issues and see if I can commit to help sorting these out. I certainly think that the current output is the best use of the limited capabilities. The divergences endpoints are IMHO useful but only as additional output.

Also, if you happen to have some interesting diverging models+data, I’d be happy to collect them to have more examples and better understanding whether this diagnostic is worth the hassle.

I think you can also formulate a theoretical-ish argument why the divergence endpoints look good in the examples above: we are trying to choose one point to represent a whole trajectory. Choosing one such that the chain stays valid means we are choosing points that are in a sense close to other points in the chain. Choosing the ends means choosing the ones that are farthest away. The latter may help you spot, what is different about the problematic regions. You could also probably think about it as censored data.


#16

Don’t worry about it. It’s basically the iron law of the internet that it sucks for communication unless people already agree. Please do post if you look at the conversation so far and want to push it forward. It might be helpful to do a call focused on resolving this stuff so we can get a more fleshed out proposal to the group.

For a set of trajectories we could probably do some clustering to identify a region where the energy grows faster (or whatever metric gets thresholded, can’t recall RN) rather than worrying about which point to viz.


#17

That was a cool blog post. I haven’t read all the posts in here yet, but just for fun I made a heatmap of the log-likelihood with samples from Stan’s VI overlayed. I guess this is why we have to be careful about inferring posteriors using VI.

image


#18

Great idea! I often find that I end up with single digit numbers of divergent transitions, and it can be hard to see where they are concentrating. If I increase \delta, it can get rid of the transitions, but slows down the model, so end up with of trial and error to eliminate them. It looks like this gives a much better idea of where things go of the rails.

I had some in one of the model parameterizations in my StanCon submission. I reparameterized it to get rid of them. But it would be interesting to see what your analysis shows.


#19

Let me clarify some important issues here.

Divergences are not points – they are entire trajectories. The default threshold used to identify divergences is somewhat arbitrary and hence the point that first passes the threshold need not have any interpretability. There are strong theoretical reasons why selecting any particular point is not well-defined and hence a poor diagnostic.

So why not use the entire trajectory? In particular if we had an additional diagnostic stream then why not dump all trajectory information into that stream? Well the first question is what exactly defines a trajectory? Is it just the positions or the momenta as well? Is it the positions on the unconstrained space or the constrained space?

Then there’s the issue of I/O. Trajectories for nontrivial models typically fall between 100 and 1000 points, leading to significantly increased size requirements. Even when streaming this can be a limiting factor, which then requires consideration of how to gracefully exit when we’ve run out of space to write to the new stream.

And that’s assuming that we can even pipe a new stream through the code. Because of the way the writers currently work this is a pain but not impossible (it touches a lot of code, especially tests). What really makes this a pain is that all of the trajectory building and sampling code is encapsulated away from the writers – in order to dump the entire trajectory we’d have to pass the writers through the recursive trajectory builder and interrupt the building to write out each state. As this would be a runtime decision there is the ever present danger of losing compiler optimizations with the new indirection.

There are significant coding, design, and maintenance burdens for a change like this and heuristic arguments from two models, one of which doesn’t actually show improvement over the current diagnostics, is in my opinion strongly insufficient. And I speak as someone who not only wrote all of the current sampler code (and piped in all of the current writers) but also someone who over the past few years has done all of these hacks before to experiment with algorithms and visualizations that utilize the entire trajectories.


#20

Thanks for the feedback. The point that it has been tried before with negative results is an important one. Could you provide more details/thoughts on why terminal points of divergences (or other diagnostics you’ve tried) didn’t provide useful additional information and/or for which kind of models?

However, the current state of affairs is that Stan summarizes the whole trajectory with a single point that is chosen to maintain properties of the Markov chain, not to be informative. I find it hard to believe that it would be difficult to improve on that (e.g. by summarizing the trajectory by two points). I am not searching for a universal diagnostic - just a better heuristic than what we have now. I am open to be convinced by your experience that this is not possible, but the arguments put forth so far are IMHO not very strong.

I however understand that there are costs to adding diagnostics, and I guess you are arguing more on the line that more diagnostics are not worth the cost. I see three ways new diagnostics can be costly: maintenance, performance and developer time. I think all of those costs can be kept low while the potential gain is non-trivial, but I am open to be proven wrong.

Maintenance: The diagnostic code has to be kept up to date, including the plumbing to interfaces. @sakrejda helpfully pointed out that the main bottleneck to adding diagnostics is the output plumbing from Stan which was already in need of a refactor (see this thread: Proposal for consolidated output). If this works as intended, adding new stream of data could really be just few lines of code in both Stan core and the interfaces as all the serialization and labelling would be done once. This should also keep the maintenance cost low.

Performance: This is a serious one. You correctly point out that some runtime decisions would have to be made as requiring recompilation of model when different diagnostics are desired is not very user-friendly. I don’t have enough understanding of C++ internals to be able to guess if that might affect compiler optimizations (and I guess this would be model-dependent anyway). We can however make a simple static switch between “standard output” and “advanced diagnostics” - both variants of the model could probably be compiled at the same time with little additional compilation effort.

Developer time: I currently have some time to work on refactoring the output and then testing additional diagnostics as this issue happens to be close to my heart. There is however non-negligible cost on the core team as well - some supervision will be required. I understand if you believe this is a commitment that the team should not make at this time. You may also believe that the Stan community will better benefit from me investing time in something else - if so, do you have some suggestions?

Going forward, I see at least two ways to gather more information than just the sample + end of divergent transition, if that proves to not be useful enough:

  1. Saving all steps is problematic, however, if we output RNG state and unconstrained params (and maybe a few other bits) per iteration, it should be possible to rerun an iteration on demand (e.g. only for divergent and possibly only for the ones of interest). This would however not be useful for large models where a single iteration takes a long time (but storing everything is not an option for such models anyways).

  2. We could also dump points where the difference in the Hamiltonian first exceeds a certain value (e.g. 1/4 and 1/2 of the threshold for divergence). This would be cheap and have constant overhead in output size.