Are outputs with divergent transitions not at all useful?

Hi all,

I did a simulation to test a model’s performance. In the simulation, I predetermined the “true” value for all parameters and simulated the data collection process with observation errors being imposed. I then used the simulated data to fit the model to estimate the parameters and compared the estimates with the predetermined “true” values. The accuracy and precision of the parameter estimates were acceptable (pretty close to the “true” values). However, there were large numbers of divergent transitions, which indicated the chain had encountered regions of high curvature in the target distribution. Can I still trust the model and apply it to real data (also generates divergent transitions) as the estimates are close to “true” values even with divergent transitions?

1 Like

In general, the way divergent transitions can lead to really bad inference is if the true parameter values sit in (or “behind”) the region where the divergences start cropping up. For example, if you have a funnel geometry, then the way for your inference to be badly wrong is if the true parameter values actually sit down in the narrow part of the funnel where you cannot explore. The divergences are telling you that your data doesn’t rule out the possibility that the true parameter values sit in a part of the posterior that isn’t getting explored. In your particular case you know the true parameter values and presumably they are sitting in a part of the posterior that HMC fitting is able to access, but in a real-world setting you wouldn’t know this.

2 Likes

This is a good practice, but superceded by a more rigorous approach called Simulation Based Calibration. See here for code to run SBC with R & Stan.

3 Likes

Just clarifying my own understanding: It can also be the case that the true values get sampled, but the sampler doesn’t explore the neighborhood of the true values in a manner that generates samples that reflects properly propagated uncertainty. So, your posterior distributions may have their modes centered on the true values, but the spread thereof will not reflect the proper bayesian update from priors given the data. Correct?

Not quite. Posterior distributions will not in general have their modes at the true values. If the modes were on the true values, then we wouldn’t need all this uncertainty stuff. In general, if the true value were guaranteed to reside within the region that is well explored, it would be desirable to trim off the rest. Problem is we don’t know whether that’s true.

Edit, or maybe exactly correct. Not sure :P. The point is that parts of the posterior are getting poorly explored, and because these parts are actually getting encountered by the post-warmup exploration, we have no confidence that the true values aren’t in them.

4 Likes

Thanks. This is super clear. In my case, the range of the parameter concerned in the real-world is known. And the simulation samples the “true” value for the parameters from the known range and repeats the process to ensure the ergodicity. Does it imply (not guarantee) that the true value in real-world unlikely sits in or “behind” the region where the divergences occur?

Thanks. This is super helpful.

the range of the parameter concerned in the real-world is known

Then you should encode this knowledge via the prior. If this suppresses all divergences, then you’re fine. If you’re still seeing divergences, that means that the nasty geometry overlaps regions where your prior and data collectively think that the true value might reside.

Thanks. Yes. I do use prior to reflect the realistic range of the parameter. However, the divergences are still present.

1 Like

Just wanted to expand upon @jsocolar’s answer a bit.

In Bayesian inference we aim to construct a posterior distribution that quantifies how compatible each model configuration is with both the observed data and whatever domain expertise we encoded in the prior model. The hope is that this compatibility is a reasonable proxy for how close the model configurations are to the true model configuration (if the model contains the true model configuration) or how well each model configuration approximates the true model configuration (in the more realistic case where the model does not contain the true model configuration). In many cases it is a good proxy, but to ensure robust analyses we have to verify that with, for example, simulation studies.

This is all well and good in theory, but in practice we can only estimate the compatibility encoded in a posterior distribution. In other words a computational method that we run to quantify the posterior distribution might not find all of the model configurations compatible with the observed data and the prior model. It might even include extraneous model configurations that aren’t actually all that compatible. In this case even if the exact posterior covers useful model configurations our computational approximation might not.

Divergences are a diagnostic for Hamiltonian Monte Carlo that indicate incomplete quantification of the posterior distribution. In other words if we see divergences then we know that at least some compatible model configurations are not being included. This exclusion might be small, in which case the incomplete posterior quantification might still be useful, or it might be large, in which case it will provide very misleading insights about the exact posterior behavior. The problem is that in practice we cannot distinguish between these possibilities until we identify what kind of behavior is causing those divergences. This follow investigation is discussed for example in Identity Crisis.

The most robust way to take advantage of posterior uncertainty quantification is to ensure accurate computation. Sometimes inaccurate computation can be useful, but in practice that’s a tricky game to play and personally I’ve seen it fail far more often than it succeeds.

3 Likes

Thank you for your explanation, @betanalpha. If a simulation finds that the estimates with divergences are close enough to the “true” value, does it mean that the “exclusion of some compatible model configurations” mentioned might be less important to the estimates of the particular parameter? The parameter estimates are thus still useful especially when we have strong confidence that the model structure is right. In addition, we found the divergences come from the data quantity. The divergency will disappear if we increase the data quantity (e.g., observe the state once every year instead of once every two years)

If you can get rid of divergences with better data, then by all means use the better data and feel good about the model! But that does not mean that you can trust the model output when fitting to lower-quality datasets that yield divergences.

It means that it wasn’t so important in the particular simulation you did. But when you go back to the real world, fitting to real data, you won’t know whether or not the estimates are close enough to the true value, because you won’t know the true value. The simulations tell you that for some datasets realized under some true parameter values, the divergences don’t substantially impede estimation (if anything, they’re desirable, as the portion of the posterior that they are “trimming off” doesn’t contain the true value). But hitting divergences post-warmup in the real world tells you that the true value really might sit in the region where the algorithm has failed to explore. Thus, simulation results like these should not increase your confidence in the reliability of the computation in a real-world setting whatsoever.

Edit: put another way, divergences tell you that you shouldn’t be confident in your posterior inference. That doesn’t guarantee that the inference is wrong, and so it shouldn’t be a surprise to see that the inference can be substantially right in some simulations. But this is not evidence that the inference is substantially right all or even most of the time. The tendency for the inference to be substantially right despite divergences will depend on the true values of the parameters. These are unknown; they are what you are trying to estimate. It’s not particularly useful to make statements like “if the true parameter value is x, then the computation is ok”, because if you know that the true parameter value really is x, then you don’t need a model to estimate x.

1 Like