Thanks guys, all clear for now
That’s now what my experiments showed. I profiled and with the lazy way a lot of the gradients work in Stan, the reverse sweep took about 80% of the computation and the forward pass about 20%. The evaluation’s all done with
double values, then the reverse mode is essentially doing interpreted arithmetic. The balance will also vary depending on how heavy the forward computation is. If it involves a single hairy computation that leads to a simple derivative, then the forward cost will be relatively higher.
Stan implements two
jacobian functionals, one that uses reverse mode and one that uses forward mode.
One of the reasons we haven’t been promoting forward mode is that it’s still not fully tested to our satisfaction.
I absorbed all of that into alpha_R and alpha_F (where I noted that alpha_F tends to be less than alpha_R in practice) so that I could go back to emphasizing the overall scaling.
Yup, that’s the right way to multiply it out. I think the overall message got lost though. If I have a function
f : R^N -> R^M, I can compute the Jacobian column-wise or row-wise using
Forward mode should be faster, but how much faster depends on the problem. So the choice of which to use depends on the problem and the relative size of
Reverse mode is faster only if
M << N. If they’re roughly the same size or
M is larger, forward-mode should be more efficient. The exact breakeven point will depend on the function being evaluated.
Forward mode should also use much less memory, so there’s also that consideration. and there’s no thread contention with forward mode as there’s no global shared object.
I think what you were describing, using that notation, is evaluating the same function
M times, which is a function
f: R^(M x N) -> R^M. I was describing the scaling for that function, not what was actually written down.
Sorry about the confusion; I’ll put an edit in the earlier post.
Same thing holds for doing the same function multiple times. Stan doesn’t give you a way to save a “taped” version and reevaluate. That’s probably for the better because running the tape is like running interpreted code whereas building the tape again is all compiled. The memory doesn’t seem to be such high overhead.