F: R^N -> R^M - Jacobian for M >> N - is forward-mode more efficient?


#1

Following up on the question posted here (I can’t reply to that post, because apparently my registration here is too new): http://discourse.mc-stan.org/t/repeated-ad-of-function-with-varying-arguments/105?u=akuz

The question was about computing the gradient of f: R_N to R_1 at many input points and reusing the stack. The answer to that was that it’s not possible.

However, we can also treat the evaluation at many input points M as a multivariate function from R_N to R_M. According to Wikipedia, when M >> N, then forward mode autodiff would be more efficient. https://en.m.wikipedia.org/wiki/Automatic_differentiation

So, is it correct that for the case of computing the gradient at many input points (however, with the same values of parameters wrt which we are differentiating), it would be more efficient to use forward-mode autodiff?


#2

Edit: the scaling described in this post is incorrect. Rather than describing a function f : R^N -> R^M, I was describing f : R^(M x N) -> R^M. Please see @betanalpha’s comment

This will depend on the cost of the reverse mode evaluation and the forward mode evaluation. The cost of computing the gradients using reverse-mode is O(1) of the cost to evaluate the function. So is the cost to computing a single directional derivative using forward-mode.

So, in broad strokes

  • using reverse mode: computing the N x M gradient terms will be O(M) (M passes of reverse mode).
  • using forward mode: computing the N x M gradient terms will be O(N x M).

Coefficients matter in practice, so it’ll depend on what functions you call.


#3

Hi Daniel, thanks for your reply. I’m relatively new to autodiff. Won’t forward-mode complexity be O(N)? I’m trying to compute a derivative of the kernel Gramm matrix in a Gaussian Process wrt a single parameter of the kernel (which has several parameters). So effectively I have the same function (kernel) evaluated at all combinations of two different data points. I thought about treating the kernel as acting not to R, but into R_M where M equals all possible combinations of the data points.


#4

I think you’re thinking about the reverse-mode complexity. Perhaps a little more concretely, suppose we have a three-argument function: f(x, y, z).

In reverse mode, we get this out for O(1) all in one pass:

  • the function value, f(x, y, z)
  • df(x, y, z) / dx
  • df(x, y, z) / dy
  • df(x, y, z) / dz

In forward-mode, we get the function value and a single directional derivative. So this is O(1):

  • the function value, f(x, y, z)
  • df(x, y, z) / dx

In order to also compute df(x, y, z) / dy, you will need to run forward-mode again, which will incur another O(1) cost. And the same for df /dz.


#5

That part I understand :) Would you be able to make a similar complexity estimate when f(x, y, x) -> R_M, where say M=1000?


#6

I think you’re trying to ask what happens when you call the function 1000 times? If so, M = 1000, N = 3.

In reverse-mode, it’ll be 1000 x the cost of computing the reverse-mode gradients. So O(M).
In forward-mode, it’ll be 1000 x 3 x the cost of computing one forward-mode directional derivative, so O(M x N).

Both the cost of computing a forward-mode directional derivative and reverse-mode gradient vector is both O(1), but with different constants out in front.


#7

I think you’ll have to determine empirically whether reverse-mode for computing the gradient vector is faster or slower than 3x computing a forward-mode directional derivative to make that assessment in practice.


#8

Thank you, that’s a very good reply. That’s how i was thinking about it too, but I was puzzled why on the Wikipedia page (linked in my question), they mention that forward mode is more efficient for f: R_N -> R_M when M>>N (at the end of the section called “Forward accumulation”). I know that Wikipedia is often wrong, and I’m just reading on the topic now. But is it possible that in forward-mode for f: R_N -> R_M, the computation would not be like you think it would be? I.e. when we compute one directional gradient w.r.t. first input, we already get the whole row of the Jacobian matrix? Foregive me if what I say doesn’t make sense :)


#9

@syclik is incorrect in the relative performance for forward mode. For a general function f: R^{N} -> R^{M} the cost to compute the full Jacobian will be O(M) for reverse mode and O(N) for forward mode. In other words, for reverse mode you have to sweep backwards M times to handle all of the outputs where as in forward mode you have to sweep forwards N times to handle all of the inputs.

There are also differences in the coefficients. In particular, forward mode doesn’t need to save the entire expression graph in memory like reverse mode does, so it will be a little bit faster per sweep. Another way of thinking about it is that reverse mode actually requires one forward sweep to propagate the function values so that partials can be computed, where as forward mode can do that at the same time it’s propagating the tangents.


#10

Thank you! Is forward-mode now fully implemented in Stan? I’ve seen a ticket on GitHub saying that people are working towards it, but not sure it’s working already or not?


#11

@betanalpha is that true in general? I can see it holding true when the expression graph is static, which should be most realistic cases. I’m probably missing something obvious.


#12

Nope. But we’re happy to take contributions. It’s actually not too difficult to get started with forward-mode implementations of one function at a time.


#13

Yes. It has nothing to do with the structure of the function and hence the internal structure of the expression graph – just the exposed nodes.


#14

Let me give more context to what I’m doing. The f: R^N -> R^M function will be defined as follows. Imagine there is a function g(x_1, …, x_N, z): R^{N+1} -> R, and I have a finite number of values for z: {z_1, …, z_M}. Think about x as parameters, and z as data points. I’m going to construct the function f in such a way, that the output in each dimension m (1<=m<=M) is equal to the corresponding value of g with z fixed to z_m. This would be a way to compute the gradient at many input points at the same time. Which was the original question asked in the other post linked in my original post above.


#15

One other very important detail to keep in mind here: when we talk about relative performance we are talking in terms of the cost of computing the target function. So when we say that reverse mode gradients of a function f:R^{N} -> R cost O(N) we really mean “the cost of computing the reverse mode gradient will be about N function evaluations”. In general there is a constant, so we could more accurately say something like C = alpha * N * C_f, where C_f is the cost of one function evaluation.

This means that there is no free lunch in your case, where you have f: R \times X -> R and what to compute the gradient for various difficult values of the input data x \in X. If you used reverse mode the cost would be C_R = alpha_R * N * C_f = alpha_R * C_f and C_F = alpha_F * M * C_f = alpha_F * C_f for every datum and hence alpha_R * I * C_f and alpha_F * I * C_f for I inputs. What happens for your proposed trick of creating a function f' : R -> R^{I}? Well now the cost will be C_R = alpha_R * I * C_f’ and C_F = alpha_F * I * C_f’. But now C_f’ = I * C_f, so the costs are actually C_R = alpha_R * I * C_f and C_F = alpha_F * I * C_f.

The cost for forwards mode in both cases is the same, and you’ve just made reverse mode worse because of wasted computation. Now in practice alpha_F <~ alpha_R so you might prefer forward mode for these 1-1 functions, but just do forward mode I times instead of trying to cram everything into one function.


#16

See my last post!


#17

Thanks for the detailed explanation. The “trick” I was suggesting doesn’t result in 1-1 functions, because there it more than one parameter. Instead, as described in my previous post, the f: R^N -> R^M function is in fact a collection of M functions R^N -> R, all of which depend on the same parameters in R^N, wrt which I’m differentiating. If I want to treat each of these functions separately, to compute the gradient, and then collect them into Jacobian, then obviously I should use reverse-mode for each of them, because it’s faster than forward-mode per each function R^N -> R. But if I treat them all together as function R^N -> R^M, then presumably the forward-mode would be faster. That was my original question, but now I’m not sure :)


#18

I took 1-1 function from

but it doesn’t matter. The costs will be almost exactly the same if you do M reverse mode gradients on $f : R^{N} -> R$ or one forward mode gradient on $f: R^{N} -> R^{M}$ (not that you can actually do one forward mode – in practice you compute that Jacobian with N forward mode sweeps) once you account for the cost of the function computation itself.


#19

Thank you!!!


#20

+1. Got it.

I don’t think I’m mistaken for computing the same function over multiple inputs, though, which is what I think was asked? Not sure any more.