Sampler Diagnostics

Thinking more about it, if I remember this correctly the energy should be constant throughout each path, and the divergence crition something like the accumulated energy deviation. Is this correct? If so, all I need might be a way to calculate the energy at each leapfrog step. Since the positions and momenta are returned this should be possible. I’d also need a function that returns the log-likelihood(± some constant) given a position (or ideally the energy itself). Not sure if I’m on the right track here.

It uses an identity matrix unless you tell it otherwise.

Correct

It’s not checking if the trajectory is divergent, but that should be possible to do from the info it returns.

The Metropolis correction isn’t necessary with multinomial NUTS (it doesn’t use it).

The ham object here: https://github.com/bbbales2/RNUTS/blob/master/rnuts_example.R#L12 comes with a member variable that can be used to compute the current potential energy: https://github.com/bbbales2/hamwrapper/blob/master/R/HamiltonianSystem.R#L60 (ignore the Jacobian there – it computes gradients in that function call but throws them away and just returns the potential energy).

We’re supposed to preserve the Hamiltonian. I think the divergence criterion in Stan is just an indicator that says we’re X units of energy away from the true Hamiltonian at some point in the trajectory. When that happens we say the trajectory is divergent. I’m not 100% sure what that threshold is, but you can probably make up a suitable one :D.

Ah, I now see that the mass matrix can be defined in the function that does the compilation. I didn’t expect it there, as it’s something that changes between warmup phases, which occur after compilation.

Thanks for the pointer. Folloging it I saw that there is also that other member variable “H”, which directly returns the (potential+kinetic) energy.

With respect to that I found this line in the manual:

Is that a typo there? Shouldn’t it be 10^-3 ?

I feel that that in order to test my ideas for divergence diagnostics properly and to compare them to the existing diagnostics in a meaningful way I probably need a sampler that behaves in the same way as Stan does in the presense if divergences. This means that it shouldn’t continue sampling from divergent parts of the integration path, for example.
If I knew what exactly Stan does when the divergence criterion is met I could just manually run the individuals samples using oneSampleNuts() and compute the correct starting point for the next sample from the output of oneSampleNuts(). I tried reading through the documentation, papers and source to understand what exactly Stan does when the divergence criterion is met, but it’s pretty difficult to do so for me.
Maybe all that is needed is the correct formula for the multinomial weights. I’m not sure.

As far as I know the only thing Stan does with a divergent transition is mark that it is divergent. It still samples from the trajectory in the same way as it always would.

Betanalpha wrote:

So the divergence does influence how the sample is chosen.

The quote is from here:

I’ve now solved this problem, at least provisionally, by changing your oneSampleNuts function to check for divergence in addition to U-turn before merging in a new subtree. I hope this is close to how Stan does it.

Yeah yeah, looks like you’re right, my bad: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/base_nuts.hpp#L282

Hey,

I just wanted to let you know that I’m working on this and making some progress.

To illustrate this here are some images, without much explanation for the time beeing. Divergences are indicated in red.

8-school model, output of stan:

8-school model, output of my prototype:

Next: the sigmoid-model of the older thread from @martinmodrak :

stan output:

prototype output:

Here we can see that (almost?) all divergent iterations touched the left edge, and that many iterations touched the lower edge without becoming divergent. The lower edge is problematic, too, but apparrently not problematic enought to cause lots of divergences, at least with the settings I used for the sampler here.

The basic idea is that I’m indicating the position of the leapfrog step that contributed the most to the energy error of the integrator. In reality it’s a bit more complicated, though.

For now I’d just like to ask whether there are any particular types of divergence causes or models which you’d like me to use for testing. But please nothing complicated, at the moment it’s more about testing different types of problematic posteriors than complex real world problems. Larger and more complex models would come later.

At the moment I have the two models shown above, a model with a discrete step in the posterior, and a perfectly healthy multivariate gaussian.
Other ideas for testing are models with:

  1. multimodality
  2. phase changes (not sure if that is the right term)
  3. banana or ring shaped posterior
  4. eventually, something larger and more complex

@martinmodrak: I couldn’t find the code and data for the second sigmoid model that you tested in the old thread. Are they available?

8 Likes

Great work! I am intrigued.

The model is described here: Identifying non-identifiability

Not sure if you found that in the previous thread, but this is not correct - the new point is sampled from the points built up in the last doubling that did not include divergences. Also this way of choosing the points maintains some desirable properties of the sampler so that it may (although rarely actually does) sample the correct posterior even when facing divergences. Due to this point chosen by your heuristic cannot be returned as the single point from the divergent iteration.

Michael always reminded us that the whole trajectory is important. I guess your heuristic to choose one point is better than the one I proposed, but it might make sense to make the whole trajectory available. Nevertheless, I am personally heavily for having something good soon than something perfect in an uncertain future.

Right now I see great value in you just playing around with this, but should you eventually have ambition to push this into Stan core I should warn you there are some potential roadblocks that may take quite some time and effort to overcame, so please incorporate this into your expectations.

When I attempted this, the biggest block against including additional divergence diagnostics in Stan when I was playing with it was that it required a refactor of the output functionality, which we discussed at Proposal for consolidated output but then the discussion didn’t really get anywhere, we all burned out on it and AFAIK the refactor never happened.

But that’s far future: the project is definitely worthwhile on it’s own (might even end up as a nice little paper by you or something) and if it proves to work well, we’ll figure this out!

Best of luck!

1 Like

Yes, I’m aware!

Yes! It’s supposed to be in addition to the samples. It seems clear to me that there is no other way to get better diagnostics.

I absolutely agree. The thing is, though, that even if you have the whole trajectory you still need some tool to extract the info you’re looking for from it quickly and ergonomically. This heuristic is such a tool.

Currently I also envision a “dynamic” version of my diagnostic that can be used to interatively explore problematic regions in the posterior. That would require the full trajectories to be available so the diagnostic can be recomputed with different settings. But this is just an idea at the moment. I didn’t actually try it out jet. The diagnostic can not only be used on divergent iterations. It also works on non-divergent ones, and can thus be used to diagnose treedepth problems.

I liked your idea from the previos thread to store just the RNG state so that the trajectory can be reconstructed on demand.

Yes, I saw that in the old thread. One of the last posts there suggested that there are basically two main problems: finding a good diagnostic and how to pipe the info through. I can only help with the former, and hope it may provide motivation for the latter.

1 Like

Beautiful. I’ve been wanting this feature ever since I learned divergences were plotted at their starting point.

Agreed!

This got lost to me in the shuffle of Discourse, but those are really nice plots.

@Raoul-Kima If you want examples, just search for divergences here and find posts where people have posted the model and data.

There should be plenty!

Thanks!

As far as I know this is a very! common (I wonder why?) misconception. I don’t know it 100% either, but the way I understood it is that within each iteration Stan (or more precisely: it’s current (multinomial-)NUTS implementation) builds its tree of leapfrog steps till it detects either a U-Turn or a divergence, or hits the maximium treedepth. It then multinomial-samples from the tree built so far. I’m not sure whether there is actually any further difference between divergent and non-divergent iterations. The reported point, both for divergent and non-divergent iterations, is simply the “sample”, which is the point that is chosen in the multinomial sampling. But as I said, I’m not 100% sure about this.

These would be real-world examples, and I will look at that at some point. At the moment, however, I’m mainly looking for theoretically motivated models that try to provoke divergences in “creative” ways. In other words, I’just trying to cover all the different base types of problems.

2 Likes

That’s how base HMC handles the Metropolis correction (which multinomial-NUTS doesn’t do). Maybe base-NUTS did as well – not a bad assumption.

I dunno if we really know the base set of problems. But fair to ask. Things that are correlated and things that have heavy tails and things that are high dimensional are the standard difficult things from a statistics perspective.

But those problems are sometimes statistics problems, sometimes numeric problems, sometimes adaptation problems, and sometimes coding problems. The issue I have with divergences is it’s hard to get to the bottom of what’s causing them without just guessing. They definitely tell you something is up – but it’s hard to get much else than that from them. Search for threads with “divergences” and you’ll see how shaky our advice is :D.

Here’s a case I think there were divergences cause of a lack of constraints: Divergent transitions in hierarchical logit

Here’s a model you can get to run without divergences if you make adapt_delta really high (but you’ll get them before that): accel_splines.data.R (207.4 KB) accel_splines.stan (2.8 KB)

Here’s a model on the forums where we had to do a non-centering and then a little parameterization change to get rid of divergences: Divergent transitions

Here’s a @betanalpha model where there is funnel behavior but it’s in a weird place: Underdetermined Linear Regression

1 Like

Adding to what @bbbales2 said, I would certainly cover (I didn’t test if the examples actually produce divergences on their own, but know that some models I’ve used that included those examples had divergences):

  • multimodal posteriors, (e.g. mixture model without and ordering constraint, 2D mixtures)
  • numerical issues (e.g. having a model that has to compute log(1 + p) for parameter p very close to zero or using normal_lcdf(0 | mu ,1) for parameter mu < -15)
  • Multiplicative weak non-identifiability, e.g. y ~ normal(a * b, 0.1) with no further information on a and b except for priors.

Best of luck!

Thanks for the suggestions! I’m working on it.

Here’s the output for a model that is numerically challenging:

In this particular case it manifests as vertical stripes. A pattern that is distinct from many other sources of sampling problems. But I don’t think this will always be the case. This model didn’t actually throw any divergence warnings (in the test system, I didn’t test with Stan), but the diagnostic doesn’t require that to work. I just set it to show the most problematic parts of the posterior.

Here’s the code:

parameters{
  real a;
  real b;
}
model{
    b~lognormal(-36,.5);
    log(1+a)~normal(0,b);
  }

Here’s how it looks without the diagnostic overlaid:

Some of the steps are easy to see, but many aren’t.

4 Likes

Looks cool!

So right now there are samples and then there is some additional output that is printing the steps that contributed the most to the error in lp or something like that?

Do you have an explanation for the stripes, or why we’d expect them in this model?

In the current version there are two components to it, but this is still subject to change:
The first component is that for each iteration two points are returned: The posterior sample and the diagnostic point, as you wrote.
The second component is that there is an additional diagnostic variable alongside energy__, lp__, divergent__ and such, which gives the contribution of the leapfrog step originating from that point to the energy error. This can be used to filter the returned diagnostic points to make the pattern clearer, expecially when there are no or few divergences. This second component alone already is sufficient to create diagnostic plots that show a lot of the properties of the plots shown above (in this case it is used to filter the posterior samples instead of the diagnostic points). This is very interesting as this second component is hopefully much easier to implement in Stan than the first component. But the plots are most informative if both components are used.

These are not part of the model. The model should return an almost perfectly gaussian distribution on the x-axis, with the y-axis determining the standard deviation of this gaussian. The stripes seem to be caused by the numerical imprecisions in the calculations due to the extremely small variance of the gaussian (10^-15) in comination with the log-transform. The point of this test was to see how the diagnostic reacts to numerical problems, and if it can be used to differentiate them from other types of problems.

3 Likes

It goes forwrd and backward in time at random, doubling the number of steps each time. If there’s a U-turn or divergence in the final doubling, the whole final doubling gets thrown out to maintain reversibility.

The original NUTS in Stan and the JMLR paper used slice sampling. But the key is the bias toward the last doubling. It’s not just based on the joint density of position and momentum.

How are the red draws selected as being problematic?

The example model’s just the exp-transformed funnel centered at -36. But there’s a problem here that a is constrained to be greater than -1 but that’s not in the model. The declaration of b needs to be changed for consistency. It’s very easy to take independent draws from this model because there’s no data.

Ah, interesting, so basically the currently reported divergences are biased away from the location of the “problem zone” because the part of the path in the problem zone is partly or completely removed before the sample gets chosen.
I think I’m not doing that in my test system. I keep the last doubling so I can diagnose these regions.

The red points do not represent draws from the posterior. And they are not necessarily very problematic, in fact they might be perfectly fine. They just represent “the most problematic parts of the posterior”, which in an unproblematic posterior is of course unproblematic. If your model throws divergences, that is when you know that there is probably something problematic at hand.

The procedure to generate the red points is roughly as follows:

For each sampler iteration (divergent or not) one red point is returned. This point is basically the starting location of the leapfrog step with the highest energy difference between its startpoint and endpoint. In practice it’s not quite that simple, because the integrators errors also introduce errors in the point selection. For example, if it diverges it often goes to some crazy location and then produces leapfrog steps with even bigger errors there. Therefore the red point is actually multinomial sampled from amongst all leapfrog steps in that iteration, weighted by the log of the energy error produced in that step and by the posterior density. I’m not sure out of my head if I have any other correction terms. I’ll do a detailed writeup once I’m satisfied with my testing on more diverse and realistic models.

These red points (one for each iteration) then are further filtered, to more clearly separate the problematic from the unproblematic parts. If the model has a large number of divergences it’s a simple as filtering out points from iterations without divergence. If the number of divergences is low or zero (e.g. models with treedepth problems) then i’m filtering by the size of the energy error produced in the leapfrog steps which the red points represent.

The model was not designed to be error free. I think for the purposes of this test it’s fine.
But I don’t think it is easy to take draws from the model, because it approaches the limits of numerics. My test system at least wasn’t able to, as the images show. I didn’t test with stan itself, though.

note: I made at least one mistake in the summary I tried to provide yesterday out of my head, and this is of course anyways all still subject to change (and in fact, I just introduced a big change). I’ll provide an accurate and more detailed description when it is more stable.

2 Likes