When I hear about adding derivatives to stan-math it’s typically for speed. Is it also the case that implementing derivatives for functions can be more stable than autodiff?
Yes. See for example the “GLM” density functions like
bernoulli_logit. These functions compose a probability mass function with the inverse link function which allow for various cancellation of large intermediate terms that can be risk floating point overflow and underflow both in the evaluation of the function and its gradients.
This isn’t so much particular to autodiff, however, as it is to implementing function compositions directly in general.
I found this blog post - introduction to automatic differentiation - from 2013 that says the same thing as you in more words.
Disclaimer: I have not actually personally seen any literature doing numerical error analysis of programs produced by automatic differentiation. The technique does avoid the gross problem of catastrophic cancellation when subtracting f(x) from f(x+dx), but may do poorly in more subtle situations. Consider, for example, summing a set of numbers of varying magnitudes. Generally, the answer is most accurate if one adds the small numbers together first, before adding the result to the big ones, rather than adding small numbers to large haphazardly. But what if the sizes of the perturbations in one’s computation did not correlate with the sizes of the primals? Then AD of a program that does the right thing with the primals will do the wrong thing with the perturbations.
It would be better, in this case, to define the summation function as a primitive (for AD’s purposes). The derivative is also summation, but with the freedom to add the perturbations to each other in a different order. Doing this scalably, however, is an outstanding issue with AD.