We’re currently tuning a LPDF where calculating the (manually coded) derivatives is quite computationally expensive (i.e doing numerical integration). Until now, I haven’t noticed a real difference in the results (but in the computation time) of setting the precision to 1e-2 compared to 1e-5. Especially the acceptance rate and convergence looks fine.
Are there any non-obvious tradeoffs of using less precise derivatives? My broad understanding of HMC is that they are merely used to guide the sampler, is this correct? Is it then correct to assume that by calculating them not that precisely only this guiding mechanism might suffer (maybe resulting in lower acceptance rate) or will this also affect the validity of the results (for example because it affects the ergodicity)?
AFAIK, as long as Stan/HMC/NUTS samples efficiently you should not have to be worried more than usually. However, I’m not 100% sure about that!
I’d also presume that one could come up with some example where Stan’s diagnostics do not complain (high average acceptance rate, no divergences, low rhat, high ESS etc), but your inference is “invalid” (e.g. biased) due in part to the inexact gradients.
Maybe this thread and the referenced linked therein can help you as well: What (exactly) happens for only piecewise continuous posterior densities, gradients or wrong gradients?
That’s right. The way to understand HMC is that it’s a Gibbs step updating momentum followed by a Metropolis step updating both position and momentum. That second Metropolis step makes a deterministic proposal uses the leapfrog algorithm to simulate the Hamiltonian dynamics for a fixed integration time. The leapfrog algorithm is really good at preserving the Hamiltonian over long trajectories, so it’s well suited for making proposals with high acceptance rates. Perhaps counterintuitively, it doesn’t provide great solutions of Hamilton’s ODE.
Given that it’s just Metropolis, you have a lot of latitude in formulating the proposal. Leapfrog assumes it’s going to be reversible so there’s no Hastings-style correction needed.
All of our gradients are imprecise to some extent because they’re being evaluated with floating point arithmetic. The question is really just how much imprecision you can get away with in the trajectory while still preserving the Hamiltonian approximately. If that gets too bad, the acceptance rate will be too low to be useful.
One has to be careful when discussing performance and validity. For most implementations of Hamiltonian Monte Carlo reasonable errors in the gradient evaluations will not affect the asymptotic performance of Markov chain Monte Carlo (what happens after an infinite number of iterations) but they can very strongly affect the preasymptotic performance (what happens after a finite number of iterations, which is to what we’re limited in practice).
The preasymptotic performance of Hamiltonian Monte Carlo relies on the consistency between the target density function and its gradient. This is what allows for long numerical Hamiltonian trajectories that rapidly explore the target distribution without straying away into unimportant neighborhoods. How consistent the target density function and its gradient need to be will depend on the specific details of any application.
Fortunately there is an empirical diagnostic, albeit a subtle one, that you can check to see if inaccurate gradient evaluations might be causing problems. Stan’s adaptation of the integrator step size assumes that the gradient and the target density are compatible. If everything is playing well together then decreasing the integrator step size should increase the average adaptation statistic (it’s not actually a Metropolis acceptance rate because Stan’s Hamiltonian Monte Carlo sampler is not based on the Metropolis method).
Erroneous gradient evaluation typically manifests in a situation where decreasing the integrator step size does not increase the average adaptation statistic. This often causes Stan’s adaptation to end up in a weird state where the step size is forced to a really small value because it keeps trying to go smaller and smaller thinking that eventually the adaptation statistic will start to increase again. These problems can be much harder to catch in other Hamiltonian Monte Carlo tools that don’t use this and of step size adaptation.
If you don’t see this kind of behavior then your approximate gradient may be sufficient for you problem. That said make sure to keep an eye out on all of the other sampler diagnostics.