Relative delta in finite difference approximations of derivatives?

I’m using finite differences to automatically test autodiff vs. double-based implementations. I’m running into problems with the functional stan::math::finite_diff_gradient (in stan/math/prim/mat/functor/finite_diff_gradient.hpp) when inputs are small or large.

Our finite differences algorithm uses a default epsilon of 1e-3 and evaluates f(x) at x, x +/- epsilon, x +/- 2 * epsilon, and x +/- 3 * epsilon. I can configure the epsilon per call, but I’d rather have something more automatic so that the tests remain simple.

QUESTION 1: Would it make sense to use an epsilon for finite differences that is defined relative to the input value x, say something like epsilon * abs(x)?

QUESTION 2: Would it make sense to build that directly into the finite differences functionals (there’s also the Hessian and gradient of Hessian), or should I do it from the test framework? There’s the issue of backward compatibility if the meaning of epsilon changes from absolute to relative.

QUESTION 3: (Extra credit) Any hints about evaluating at x = 0? What I do now is just default to an absolute error test; all the other tests use relative error to compare autodiff gradients and finite difference gradients.

No, otherwise the error would be a function increasing with |x|.

Why x=0 deserves a special treatment? Can’t we also compare autodiff against finite diff?

At any x the minimum step size has to be at least such that both x+\epsilon and f(x+\epsilon) are representable. There is a function std::nextafter which “Returns the next representable value of from in the direction of to.”. This doesn’t yet guarantee that intermediate computations don’t fail due to the lack pf precision. Maybe the largest value from nextafter for x and f(x) multiplied by something to avoid lack of precision in intermediate steps?

Wikipedia has an answer to all the questions

Practically 1e-7 usually works well for not extreme values.

1 Like

In practice I have always used 1e-6 which I found to be necessary and sufficient in most cases, in line with the advice in the Wikipedia article.

Regarding the error analysis, what we really want to know is the error in the finite difference approximation, which for a second-order central difference I believe scales as the third-derivative of the target function,

\widehat{\frac{ \partial f}{\partial x}(x_{0})} - \frac{ \partial f}{\partial x}(x_{0}) = \frac{ \partial^{3} f}{\partial x^{3}}(x_{0})

The difficulty is then in estimating the value of that fourth-derivative at the input point, which is often nothing like the value of the input itself especially for inputs near zero.

One heuristic would be to add an additive offset to better approximate the fourth derivative,

\frac{ \partial^{3} f}{\partial x^{3}}(x_{0}) \approx \alpha \vert\vert x_{0} \vert\vert + \beta

Another would be to use a finite-difference approximation of the third derivative itself. The interesting thing about this approach is that it gives you some handle on the scaling – by changing \epsilon and comparing to the third derivative estimate you can see if you’re getting a more precise estimate (which is good) or a flat estimate (which indicates floating point problems).

Sadly my numerical methods notes are at my parents’ place so I can’t speak to a few there tricks that I vaguely remember but can’t find references for.

That’s actually what I want. I’m evaluating relative error

\textrm{err}(u, v) = \frac{\displaystyle u - v}{\displaystyle 0.5 \cdot (|u| + |v|)}.

If x = 10^{10}, I want different error than x = 10^{-10}.

Thanks! They need to consolidate the Wikipedia pages around this!

I found the finite difference page approximation page, which as a finite difference derivative approximation section that doesn’t link to the page @Daniel_Simpson found.

The simple recommendation is to evaluate f(x) and f(x + h) with h = \sqrt{\epsilon} \, x, where \epsilon is the machine precision, whihc is roughly \epsilon \approx 10^{-16} for double-precision, so \sqrt \epsilon \approx 10^{-8}. The more complex technique uses second derivatives to choose a better h, but I don’t think we need to be that extreme.

Obviously, this won’t work when x = 0, hence my question 3.

I would do something like max(epsilon, epsilon*|x|)

I was using x == 0 ? epsilon : epsilon * fabs(x).

Specifically, if x = 1e-20, I want the difference used to be 1e-28, I think. It’s just that when it’s zero, I can’t make it zero. I could make it really small, though, like 1e-20, but then I’d worry about overflow/underflow issues.

Oh, I thought you were referring the error in finite difference formula…

I don’t think you want the difference to be 1e-28 for most functions. If f'(x) < C for some x in a neighbhourhood of zero (which will basically always happen for functions that we care about in Stan), then |f(1e-28) - f(0)| \leq C * 1e-28 which will almost definitely lead to underflow.

I looked at what numerical recipes in C says about what to do near zero and it suggest s replacing x with a “characteristic scale”. Basically, the argument is that
h \approx \sqrt{\epsilon_m\frac{f(x)}{f''(x)}} \approx \sqrt{\epsilon_m} x_c
where x_c = (f(x)/f''(x))^{1/2} \approx x where that last approximation is more for convenience than anything else, but holds esxactly when f(x) = x^2 is quadratic (which is locally true). Near zero it is suggested that x is replaced with a “characteristic scale” over which the function changes. So the option above isn’t great. 1e3 or 1e6 as a default minimum characteristic scale makes sense to me, whereas x == 0 ? epsilon : epsilon * fabs(x) is probably asking for underflow

I asked and I received. Dan was right. My original version underflows for small x. No wonder all the numerical analysis people are so worked up about subtraction! It was worth dusting off Numerical Recipes as they have a useful decomposition of error into rounding and approximation error. Dan already included all the relevant conclusions.

I wound up using h = \epsilon^{1/3} * \max(1, |x_n|) for the stepsize for differencing for input dimension x_n (where \epsilon is the machine precision, or about 10^{-16} for the double-precision floating point we’re using). I took the cube root because it was recommended for the symmetric case of evaluating f(x + h) and f(x - h).

The de Levie paper has a section on choosing step size which seems to look like what Numerical Recipes is saying, but I’m not a native speaker of applied math(s).