I’m using finite differences to automatically test autodiff vs. double-based implementations. I’m running into problems with the functional stan::math::finite_diff_gradient (in stan/math/prim/mat/functor/finite_diff_gradient.hpp) when inputs are small or large.
Our finite differences algorithm uses a default epsilon of 1e-3 and evaluates f(x) at x, x +/- epsilon, x +/- 2 * epsilon, and x +/- 3 * epsilon. I can configure the epsilon per call, but I’d rather have something more automatic so that the tests remain simple.
QUESTION 1: Would it make sense to use an epsilon for finite differences that is defined relative to the input value x, say something like epsilon * abs(x)?
QUESTION 2: Would it make sense to build that directly into the finite differences functionals (there’s also the Hessian and gradient of Hessian), or should I do it from the test framework? There’s the issue of backward compatibility if the meaning of epsilon changes from absolute to relative.
QUESTION 3: (Extra credit) Any hints about evaluating at x = 0? What I do now is just default to an absolute error test; all the other tests use relative error to compare autodiff gradients and finite difference gradients.
At any x the minimum step size has to be at least such that both x+\epsilon and f(x+\epsilon) are representable. There is a function std::nextafter which “Returns the next representable value of from in the direction of to.”. This doesn’t yet guarantee that intermediate computations don’t fail due to the lack pf precision. Maybe the largest value from nextafter for x and f(x) multiplied by something to avoid lack of precision in intermediate steps?
In practice I have always used 1e-6 which I found to be necessary and sufficient in most cases, in line with the advice in the Wikipedia article.
Regarding the error analysis, what we really want to know is the error in the finite difference approximation, which for a second-order central difference I believe scales as the third-derivative of the target function,
The difficulty is then in estimating the value of that fourth-derivative at the input point, which is often nothing like the value of the input itself especially for inputs near zero.
One heuristic would be to add an additive offset to better approximate the fourth derivative,
Another would be to use a finite-difference approximation of the third derivative itself. The interesting thing about this approach is that it gives you some handle on the scaling – by changing \epsilon and comparing to the third derivative estimate you can see if you’re getting a more precise estimate (which is good) or a flat estimate (which indicates floating point problems).
Sadly my numerical methods notes are at my parents’ place so I can’t speak to a few there tricks that I vaguely remember but can’t find references for.
The simple recommendation is to evaluate f(x) and f(x + h) with h = \sqrt{\epsilon} \, x, where \epsilon is the machine precision, whihc is roughly \epsilon \approx 10^{-16} for double-precision, so \sqrt \epsilon \approx 10^{-8}. The more complex technique uses second derivatives to choose a better h, but I don’t think we need to be that extreme.
Obviously, this won’t work when x = 0, hence my question 3.
Specifically, if x = 1e-20, I want the difference used to be 1e-28, I think. It’s just that when it’s zero, I can’t make it zero. I could make it really small, though, like 1e-20, but then I’d worry about overflow/underflow issues.
I don’t think you want the difference to be 1e-28 for most functions. If f'(x) < C for some x in a neighbhourhood of zero (which will basically always happen for functions that we care about in Stan), then |f(1e-28) - f(0)| \leq C * 1e-28 which will almost definitely lead to underflow.
I looked at what numerical recipes in C says about what to do near zero and it suggest s replacing x with a “characteristic scale”. Basically, the argument is that h \approx \sqrt{\epsilon_m\frac{f(x)}{f''(x)}} \approx \sqrt{\epsilon_m} x_c
where x_c = (f(x)/f''(x))^{1/2} \approx x where that last approximation is more for convenience than anything else, but holds esxactly when f(x) = x^2 is quadratic (which is locally true). Near zero it is suggested that x is replaced with a “characteristic scale” over which the function changes. So the option above isn’t great. 1e3 or 1e6 as a default minimum characteristic scale makes sense to me, whereas x == 0 ? epsilon : epsilon * fabs(x) is probably asking for underflow
I asked and I received. Dan was right. My original version underflows for small x. No wonder all the numerical analysis people are so worked up about subtraction! It was worth dusting off Numerical Recipes as they have a useful decomposition of error into rounding and approximation error. Dan already included all the relevant conclusions.
I wound up using h = \epsilon^{1/3} * \max(1, |x_n|) for the stepsize for differencing for input dimension x_n (where \epsilon is the machine precision, or about 10^{-16} for the double-precision floating point we’re using). I took the cube root because it was recommended for the symmetric case of evaluating f(x + h) and f(x - h).
The de Levie paper has a section on choosing step size which seems to look like what Numerical Recipes is saying, but I’m not a native speaker of applied math(s).
Really fascinating idea: two adjacent floats, when interpreted as integers, are also adjacent. This means that you can essentially cast[0] a float to an int and then subtract the two to see how many floats are in between the two numbers. This is said to be looking at how many Units in Last Place ("ULP"s) they differ by, and seems like a great way to think about floating point error. There’s a lot more detail in that post, highly recommended for folks dealing with this.
[0] I don’t think actual casting works because the compiler will try to convert. They use a trick with a C union for this.
But if you also continue to the conclusion the recommendation is to use ULPs or relative checks for values away from zero and absolute checks near zero, so I don’t know if introducing ULPs will improve anything beyond the relative/absolute strategy to which the conversation had converged.
He actually says that comparing with 0 is a special case, not “values near zero,”
If you are comparing against zero, then relative epsilons and ULPs based comparisons are usually meaningless. You’ll need to use an absolute epsilon, whose value might be some small multiple of FLT_EPSILON and the inputs to your calculation. Maybe.
If you are comparing against a non-zero number then relative epsilons or ULPs based comparisons are probably what you want. You’ll probably want some small multiple of FLT_EPSILON for your relative epsilon, or some small number of ULPs. An absolute epsilon could be used if you knew exactly what number you were comparing against.
He literally says “near zero”: “It turns out that the entire idea of relative epsilons breaks down near zero.”
Following his arguments the problem of comparing to zero is similar to the problem of comparing the differences of two small numbers, and the need for absolute checks remains.
This technique is necessary any time you are expecting an answer of zero due to subtraction.
I guess with finite differences, we expect answers of literally zero sometimes, so yeah I agree that we need a mix of an absolute and relative check. I came into this thinking that a ULP check is obviously a lot cleaner than the h = \epsilon^{1/3} * \max(1, |x_n|) heuristic, but I didn’t mean to be contentious with that. Happy either way.
This is exactly what I’m doing in expect_near_rel. I came to this the hard way then learned it was the industry standard :-)
Technically that’s right. The case where one argument is zero degenerates. And that’s how I implemented it. I provide arguments in terms of relative difference, but that reduces to the ULP scale once you rescale for relative diff.
But I’m honestly not sure how much we should care about relative error for numbers near but not exactly at zero.
Exactly my experience. :-)
Only in linear functions :-) Otherwise, we’re approximating curvature with a step and don’t expect finite diffs and real derivatives to be exact. We only expect about half machine precision with finite diffs.