What (exactly) happens for only piecewise continuous posterior densities, gradients or wrong gradients?

Hey everyone,

I’m looking for scientific/mathematical references looking at the above question. Specifically, I’m wondering whether it’s a priori clear what happens in the asymptotic limit? Does HMC/“NUTS” (e.g. as implemented in Stan) still sample from the distribution defined through the log posterior density that is used to determine the acceptance rate, even if

  • that distribution’s PDF has discontinuities?
  • the gradient of that distribution’s (log) PDF has discontinuities?
  • the vector field used to simulate the Hamiltonian dynamics is (potentially markedly) different from the gradient field of the distribution’s (negative log) PDF?

I had found this prior topic where @betanalpha shared some thoughts. I was wondering whether he or someone else knows of a reliable reference which discusses such questions? Neal for example claims (without reference or proof):

HMC can be used to sample only from continuous distributions on Rd for which the density function can be evaluated (perhaps up to an unknown normalizing constant). For the
moment, I will also assume that the density is nonzero everywhere (but this is relaxed in
Section 5.5.1). We must also be able to compute the partial derivatives of the log of the
density function. These derivatives must therefore exist, except perhaps on a set of points
with probability zero, for which some arbitrary value could be returned.

2 Likes

Verifying asymptotic Markov chain Monte Carlo behavior is more complicated than is typically presented. Based on your comments I believe that you are referring to the invariance of a given target distribution, but there are other technical conditions that also have to be satisfied for asymptotic convergence let alone useful preasymptotic convergence. For a recent review see [2110.07032] A Short Review of Ergodicity and Convergence of Markov chain Monte Carlo Estimators or the references therein.

I mention these because while some of the necessary conditions can be guaranteed by the construction of a Markov transition (for example aperiodicity is guaranteed if the transition has finite probability of staying at the initial state) some are just assumed to hold (namely irreducibility and Harris recurrence). These are typically not problematic for most Markov transitions with continuous stationary density functions, but bad enough discontinuities can wreak havoc here.

Anyways let’s presume that those technical conditions hold and that we’re mostly interested in the invariance of a given target distribution. Technically most Hamiltonian transitions – whether the “hybrid” schemes that use the end of numerical Hamiltonian trajectories as a Metropolis proposal or the schemes that sample states from the entire numerical Hamiltonian trajectory – will preserve the target distribution even if contains discontinuities. Assuming the previous technical conditions these Hamiltonian transitions will eventually reach all points in the ambient space an infinite number of times even when it has to navigate around discontinuities.

Unfortunately this doesn’t mean much for practical convergence. Even if the asymptotic convergence is guaranteed in the presence of discontinuities the behavior of Markov chain Monte Carlo estimators will suffer dramatically because of them.

The problem is that the practical performance of Hamiltonian Monte Carlo depends on how well the numerical Hamiltonian trajectories can explore the target distribution. When the target density function is smooth the symplectic integrators that generate these numerical trajectories guarantee all kinds of amazing behaviors. In particular so long as the symplectic integrator remains stable (i.e. no divergences) the numerical trajectories will stay close to regions of high target probability even after long integration times so that states in the numerical trajectory far away from the initial state have a high probability of being selected in each transition.

Symplectic integrators, however, can only do so much. In the presence of discontinuities (or bias in the gradients) these numerical integrators don’t enjoy the same robust performance. The numerical Hamiltonian trajectories tend to decouple from regions of high posterior probability so that the longer we integrate the less useful the new states are. In this case the Hamiltonian transitions will do the right thing and avoid those later states, but that just means that they will default to states near the initial state and the exploration of the resulting Markov chain will slow to a crawl. Even worse all of the computation spent on generating those distant states will be wasted.

Really it’s the practical performance that suffers in the presence of discontinuities and gradient errors. For an example of the latter see [1502.01510] The Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling which includes both theoretical and empirical analyses.

Another consequence of these issues is that Stan’s integrator step size adaptation no longer hold. This adaptation is based on a calculation for well-behaved symplectic integrators that relates the “acceptance statistic” and the step size. Increasing the step size should reduce the adaptation statistics in expectation and vice versa so that tweaking the step size should allow for convergence towards any given expected adaptation statistic value. Unfortunately this relationship can break in the presence of gradient errors (not including infinite gradient at a discontinuity, having a persistent bias, etc) where decreasing the step size no longer increases the expected value of the adaptation statistic. This comes up, for example, in complex ordinary differential equation models where the autodiff’d gradients can suffer from substantial numerical error and the step size adaptation gets stuck in a bad loop where it keeps trying to decrease the step size thinking that it should eventually improve the adaptation statistic but never does.

1 Like

Thank you for responding! Yes, let’s assume that all other “technicalities” except for invariance of the target distribution hold.

I was indeed worried specifically about the case of models involving adaptive ODE-solvers, for which potentially all of my above three concerns are relevant. I.e. generally, for adaptive ODE-solvers,

  • the target PDF and its gradient have discontinuities and
  • the vector field used in (Stan’s or anyone’s) HMC/NUTS to accelerate the fictitious particle is generally
    • not the gradient of the target PDF and
    • not (guaranteed to be) even locally a gradient field.

I’ve been wondering whether, with all of these issues, we can actually “guarantee” (with the usual caveats) that inference gives reasonable estimates if Stan appears to sample efficiently? I’m guessing I’ll have to do some more research, thank you for the pointers!

In case anyone has the same questions, a proper reference which shows that if the leapfrog intergator is used (as e.g. by Stan), reversibility and volume preservation are not lost due to “gradient mismatches”, i.e. if the “the vector field used in (Stan’s or anyone’s) HMC/NUTS to accelerate the fictitious particle” is neither the gradient of the target PDF nor even a gradient field.

AFAIK, reversibility and volume preservation are the main ingredients to show detailed balance for HMC (and I presume also for NUTS), but I have not looked too carefully at it myself.

PDF or PDF-gradient-field/vector-field discontinuities should also not matter for reversibility and volume preservation, as the discretized dynamics don’t see these discontinuities.

There might be some other subtlety that I have overlooked so far, I’ll update this thread if I decide to look more closely and find something.

No, one cannot.

Symplectic integrators guarantee reversibility and volume preservation (at least provided that the gradients, however they may be defined, are deterministic along a numerical trajectory). Reversibility and volume preservation then enables state selection procedures that use only the value of the Hamiltonian (for example with a Metropolis acceptance procedure between the initial and final states in Hybrid Monte Carlo or the multinomial scheme of Stan’s dynamic Hamiltonian Monte Carlo sampler). But this just ensures that the target distribution is one stationary distribution of the resulting Markov transition distribution.

That is not enough to ensure that Markov chains converge to that stationary distribution asymptotically let alone preasymptoticaly. As I discussed above the preasymptotic performance of Hamiltonian Monte Carlo that is so celebrated depends on how well the gradients match with the Hamiltonian used in the state selection scheme.

Again messing around with the gradients deterministically does not obstruct the right stationary distribution, but it does obstruct the finite time performance of Hamiltonian Monte Carlo. In order to track the posterior the sampler needs to be able to account fo this discontinuities.

Approximate gradients based on numerical solvers can be used successful with Hamiltonian Monte Carlo when those approximations are sufficiently accurate, which cannot be taken or granted.