F: R^N -> R^M - Jacobian for M >> N - is forward-mode more efficient?

akuz · June 27, 2017, 7:23pm

Thanks guys, all clear for now

Bob_Carpenter · June 27, 2017, 9:43pm

That’s now what my experiments showed. I profiled and with the lazy way a lot of the gradients work in Stan, the reverse sweep took about 80% of the computation and the forward pass about 20%. The evaluation’s all done with double values, then the reverse mode is essentially doing interpreted arithmetic. The balance will also vary depending on how heavy the forward computation is. If it involves a single hairy computation that leads to a simple derivative, then the forward cost will be relatively higher.

Stan implements two jacobian functionals, one that uses reverse mode and one that uses forward mode.

One of the reasons we haven’t been promoting forward mode is that it’s still not fully tested to our satisfaction.

betanalpha · June 27, 2017, 10:42pm

I absorbed all of that into alpha_R and alpha_F (where I noted that alpha_F tends to be less than alpha_R in practice) so that I could go back to emphasizing the overall scaling.

Bob_Carpenter · June 28, 2017, 4:53am

Yup, that’s the right way to multiply it out. I think the overall message got lost though. If I have a function f : R^N -> R^M, I can compute the Jacobian column-wise or row-wise using

M reverse-mode passes
N forward-mode passes

Forward mode should be faster, but how much faster depends on the problem. So the choice of which to use depends on the problem and the relative size of M and N.

Reverse mode is faster only if M << N. If they’re roughly the same size or M is larger, forward-mode should be more efficient. The exact breakeven point will depend on the function being evaluated.

Forward mode should also use much less memory, so there’s also that consideration. and there’s no thread contention with forward mode as there’s no global shared object.

syclik · June 29, 2017, 7:57am

I just reread the conversation. I see why I was confused and @betanalpha and @Bob_Carpenter are correct in describing what happens with a function f : R^N -> R^M.

I think what you were describing, using that notation, is evaluating the same function M times, which is a function f: R^(M x N) -> R^M. I was describing the scaling for that function, not what was actually written down.

Sorry about the confusion; I’ll put an edit in the earlier post.

Bob_Carpenter · June 30, 2017, 10:06pm

Same thing holds for doing the same function multiple times. Stan doesn’t give you a way to save a “taped” version and reevaluate. That’s probably for the better because running the tape is like running interpreted code whereas building the tape again is all compiled. The memory doesn’t seem to be such high overhead.

Topic		Replies	Views
Max likelihood time complexity as function of number of parameters Algorithms	8	815	February 22, 2019
Speed of forward mode vs reverse mode Developers	12	1323	October 22, 2018
Soliciting syntax ideas for user defined gradients and user defined transformations Developers	9	1016	September 30, 2019
Checking if used fwd or reverse mode AD? General	2	65	November 11, 2024
Fvar<var> Developers	4	1040	June 14, 2017

F: R^N -> R^M - Jacobian for M >> N - is forward-mode more efficient?

Related topics