@nhuurre mentioned some of the important points but let me followup on a few of them and explain why Stan behaves the way it does.
“Divergence” does not describe a point but rather an unstable numerical trajectory. In practice this is defined as a numerical trajectory that includes a point where the error in the Hamiltonian function, which will be constant along exact trajectories and should oscillate for stable numerical trajectories, exceeds 1000. Technically this might result in false positives – in very high dimensions the magnitude of the stable error oscillations can in principle exceed 1000 – but I’ve never been able to construct an explicit false positive in practice.
When Stan encounters a diverging numerical trajectory it rejects the current trajectory expansion and then samples from the pre-existing trajectory. This ensures that the Markov transition still asymptotically converges to the target distribution. Asymptotic convergence doesn’t mean much without fast enough preasymptotic convergence, which the appearance of divergences questions, but it’s the safest failure mode. In any case the Stan output should not be interpreted as “a point that diverged” but rather “a point that comes from a trajectory that diverged somewhere”!
Because of this the “points from divergent trajectories” will not always fall into the problematic region of the posterior distribution. If the pre-existing trajectory is long enough then many of the states that can be sampled will actually be very far away from the problematic region. That said the overlap of the pre-existing trajectories from many divergent trajectories, and hence samples from those trajectories, will tend to concentrate near the problematic behavior. It’s the ensemble behavior of divergent transitions that provides the most diagnostic behavior.
As an example consider a posterior distribution that manifests a “pinch” or “spike”, especially in high dimensions. A numerical trajectory that starts far away from the problematic region can be very long before encountering the problematic region and diverging, and consequently a point sampled from that trajectory can be far away from the problematic region. If, however, you look at many trajectories that start far away but eventually fall into the problematic region and diverge then you’ll see them start to concentrate as they get closer and closer to the problematic region. If you sample from all of these trajectories then they will tend to be closer to the problematic region rather than away from it. Moreover, once you start building a trajectory in the problematic region then most trajectories will quickly start to diverge in both directions so that all sampled points are close to the problematic region; only rarely will the sampled momentum be oriented just right so that one end of the numerical trajectory can escape into more well-behaved neighborhoods and pull the sampled states away. This circumstance would trigger the tree depth heuristic @Raoul-Kima mentioned.
I’ve looked at many fits that exhibit divergences, sometimes going through hundreds of pairs plots, and I have yet to see an example where the divergences don’t actually concentrate. Almost always the problem is either not enough divergences to resolve any ensemble behavior – in which case one can just wrong longer Markov chains, or even more Markov chains – or more likely that I’m just not looking at the right pairs plot. See for example the discussion in Section 4.1 of Identity Crisis.
But let’s say that we do want more information from these divergent trajectories. What could we do?
Firstly the architecture of Stan itself drastically limits the possibilities. The Stan Markov chain Monte Carlo library, from input/output to diagnostics, was designed to be modular and support any Markov chain Monte Carlo algorithm. In particular everything is built around a sequence of states from the constrained parameter space augmented with only scalar diagnostic information. Passing along more states would require substantial redesigns in multiple places. Keep in mind is that back in the day we didn’t know what might compete with Hamiltonian Monte Carlo; it’s only with a decade of experience that we found nothing that was even worth implementing. With this hindsight designing around just Hamiltonian Monte Carlo would make more sense, but that is another discussion in of itself.
If we did have a diagnostic output stream available then what might we put there? Remember that a divergence isn’t a point. The divergence diagnostic indicates when a numerical trajectory has passed through a problematic region and become unstable, but not what point passed through the problematic region. In fact because unstable numerical trajectories are so explosive the terminal states can be quite far from the problematic region by the time the error hits 1000.
Consequently once a trajectory has triggered the divergence diagnostic we’d want to look at previous states in the trajectory. Unfortunately Stan’s Hamiltonian Monte Carlo sampler doesn’t keep the entire numerical trajectory in memory at any given time to keep the memory burden as low as possible. To recover those previous states we’d have to rewind the numerical integration, or start again from the initial point, either way requiring more computation.
Even then, what point would we take? Once a numerical trajectory starts to diverge we expect the Hamiltonian error to increase monotonically, but that doesn’t mean that we can just go backwards until the error stops decreasing. In general the transition from unstable to stable behavior will happen before the error bottoms out, and the state with local minimum error could be nontrivially far away from the problematic region.
Ultimately the best diagnostic power is not in any single point but rather the entire numerical trajectory, or at least some subset of the trajectory near the divergent end. There are numerous heuristics that would select reasonable subtrajectories, but we would then confront the input/output problem. How should entire subtrajectories be stored? When will storing multiple subtrajectories, in addition to the standard sampler output, introduce problematic memory burdens? If too many states have to be stored at once should the additional diagnostic output be saved only conditionally, and if so then what should the user-interface be for those checks?
Long story short the jump from the current behavior to enhanced diagnostics might sound straightforward, but there are numerous theoretical and implementation issues that have to be addressed. And then one has to ask just how much more diagnostic information does one obtain over the current diagnostics? In my opinion it’s that as much as many think, especially when the current diagnostics are used carefully.