F: R^N -> R^M - Jacobian for M >> N - is forward-mode more efficient?

I think you’re thinking about the reverse-mode complexity. Perhaps a little more concretely, suppose we have a three-argument function: f(x, y, z).

In reverse mode, we get this out for O(1) all in one pass:

  • the function value, f(x, y, z)
  • df(x, y, z) / dx
  • df(x, y, z) / dy
  • df(x, y, z) / dz

In forward-mode, we get the function value and a single directional derivative. So this is O(1):

  • the function value, f(x, y, z)
  • df(x, y, z) / dx

In order to also compute df(x, y, z) / dy, you will need to run forward-mode again, which will incur another O(1) cost. And the same for df /dz.

That part I understand :) Would you be able to make a similar complexity estimate when f(x, y, x) -> R_M, where say M=1000?

I think you’re trying to ask what happens when you call the function 1000 times? If so, M = 1000, N = 3.

In reverse-mode, it’ll be 1000 x the cost of computing the reverse-mode gradients. So O(M).
In forward-mode, it’ll be 1000 x 3 x the cost of computing one forward-mode directional derivative, so O(M x N).

Both the cost of computing a forward-mode directional derivative and reverse-mode gradient vector is both O(1), but with different constants out in front.

I think you’ll have to determine empirically whether reverse-mode for computing the gradient vector is faster or slower than 3x computing a forward-mode directional derivative to make that assessment in practice.

Thank you, that’s a very good reply. That’s how i was thinking about it too, but I was puzzled why on the Wikipedia page (linked in my question), they mention that forward mode is more efficient for f: R_N -> R_M when M>>N (at the end of the section called “Forward accumulation”). I know that Wikipedia is often wrong, and I’m just reading on the topic now. But is it possible that in forward-mode for f: R_N -> R_M, the computation would not be like you think it would be? I.e. when we compute one directional gradient w.r.t. first input, we already get the whole row of the Jacobian matrix? Foregive me if what I say doesn’t make sense :)

@syclik is incorrect in the relative performance for forward mode. For a general function f: R^{N} -> R^{M} the cost to compute the full Jacobian will be O(M) for reverse mode and O(N) for forward mode. In other words, for reverse mode you have to sweep backwards M times to handle all of the outputs where as in forward mode you have to sweep forwards N times to handle all of the inputs.

There are also differences in the coefficients. In particular, forward mode doesn’t need to save the entire expression graph in memory like reverse mode does, so it will be a little bit faster per sweep. Another way of thinking about it is that reverse mode actually requires one forward sweep to propagate the function values so that partials can be computed, where as forward mode can do that at the same time it’s propagating the tangents.

Thank you! Is forward-mode now fully implemented in Stan? I’ve seen a ticket on GitHub saying that people are working towards it, but not sure it’s working already or not?

@betanalpha is that true in general? I can see it holding true when the expression graph is static, which should be most realistic cases. I’m probably missing something obvious.

Nope. But we’re happy to take contributions. It’s actually not too difficult to get started with forward-mode implementations of one function at a time.

Yes. It has nothing to do with the structure of the function and hence the internal structure of the expression graph – just the exposed nodes.

Let me give more context to what I’m doing. The f: R^N -> R^M function will be defined as follows. Imagine there is a function g(x_1, …, x_N, z): R^{N+1} -> R, and I have a finite number of values for z: {z_1, …, z_M}. Think about x as parameters, and z as data points. I’m going to construct the function f in such a way, that the output in each dimension m (1<=m<=M) is equal to the corresponding value of g with z fixed to z_m. This would be a way to compute the gradient at many input points at the same time. Which was the original question asked in the other post linked in my original post above.

One other very important detail to keep in mind here: when we talk about relative performance we are talking in terms of the cost of computing the target function. So when we say that reverse mode gradients of a function f:R^{N} -> R cost O(N) we really mean “the cost of computing the reverse mode gradient will be about N function evaluations”. In general there is a constant, so we could more accurately say something like C = alpha * N * C_f, where C_f is the cost of one function evaluation.

This means that there is no free lunch in your case, where you have f: R \times X -> R and what to compute the gradient for various difficult values of the input data x \in X. If you used reverse mode the cost would be C_R = alpha_R * N * C_f = alpha_R * C_f and C_F = alpha_F * M * C_f = alpha_F * C_f for every datum and hence alpha_R * I * C_f and alpha_F * I * C_f for I inputs. What happens for your proposed trick of creating a function f' : R -> R^{I}? Well now the cost will be C_R = alpha_R * I * C_f’ and C_F = alpha_F * I * C_f’. But now C_f’ = I * C_f, so the costs are actually C_R = alpha_R * I * C_f and C_F = alpha_F * I * C_f.

The cost for forwards mode in both cases is the same, and you’ve just made reverse mode worse because of wasted computation. Now in practice alpha_F <~ alpha_R so you might prefer forward mode for these 1-1 functions, but just do forward mode I times instead of trying to cram everything into one function.

See my last post!

Thanks for the detailed explanation. The “trick” I was suggesting doesn’t result in 1-1 functions, because there it more than one parameter. Instead, as described in my previous post, the f: R^N -> R^M function is in fact a collection of M functions R^N -> R, all of which depend on the same parameters in R^N, wrt which I’m differentiating. If I want to treat each of these functions separately, to compute the gradient, and then collect them into Jacobian, then obviously I should use reverse-mode for each of them, because it’s faster than forward-mode per each function R^N -> R. But if I treat them all together as function R^N -> R^M, then presumably the forward-mode would be faster. That was my original question, but now I’m not sure :)

I took 1-1 function from

but it doesn’t matter. The costs will be almost exactly the same if you do M reverse mode gradients on f : R^{N} -> R or one forward mode gradient on f: R^{N} -> R^{M} (not that you can actually do one forward mode – in practice you compute that Jacobian with N forward mode sweeps) once you account for the cost of the function computation itself.

Thank you!!!

+1. Got it.

I don’t think I’m mistaken for computing the same function over multiple inputs, though, which is what I think was asked? Not sure any more.

Thanks guys, all clear for now

That’s now what my experiments showed. I profiled and with the lazy way a lot of the gradients work in Stan, the reverse sweep took about 80% of the computation and the forward pass about 20%. The evaluation’s all done with double values, then the reverse mode is essentially doing interpreted arithmetic. The balance will also vary depending on how heavy the forward computation is. If it involves a single hairy computation that leads to a simple derivative, then the forward cost will be relatively higher.

Stan implements two jacobian functionals, one that uses reverse mode and one that uses forward mode.

One of the reasons we haven’t been promoting forward mode is that it’s still not fully tested to our satisfaction.

I absorbed all of that into alpha_R and alpha_F (where I noted that alpha_F tends to be less than alpha_R in practice) so that I could go back to emphasizing the overall scaling.