I am trying to find ways to speed up ODE integration. One bottleneck is the need to do AD on the user-supplied ODE functor really often. Right now this is done using reverse mode AD and I was wondering if forward mode is a good candidate to be faster here. Bear in mind that the ODE functors are usually darn simple algebraic expressions.
Since the J_y * y + J_p is needed, the forward mode would need to be run a few times in order to get the gradient wrt to each parameter; so forward mode would need to be quite a bit faster than reverse to make up for that.
Does someone have a guess here?
Best,
Sebastian
(if this gives a speed bump, the next question is if we can use forward mode on ODE functors any time soon)
If you have N states and M parameters, I think you only need to run forward mode M + 1 times to get the coupled system (M for all the J_p terms and then 1 for the J_y y term).
You need N reverse mode autodiffs to get all this. So I think it depends on your problem.
Typically M is similar or greater to N in my applications. Anyway, I recall that you said that no virtual function calls are involved in forward mode which makes it faster. This is why I was hoping to get speedups even if we do some extra work.
Maybe I should just try it out once.
EDIT: but there are certainly N >> M cases which I know. Maybe we should actually make a runtime decision as to what we run (given we find convincing examples).
Ggogogogogo. When the variable names are vars I mean appropriately sized and typed fvar things actually lol.
J_y y:
for(int i = 0; i < y.size(); i++) {
y_var[i].d_ = y[i]; // EDIT ALEEEEEERTTTTT: I ADDED AN [i] HERE ON THE RIGHT HAND SIDE
}
vector<fvar<double> > dy_dt_var = f_(t, y_var, thetavar, x_, x_int_, msgs_);
for(int j = 0; j < y.size(); j++)
Jy_times_y(j, i) = dy_dt_var[j].tangent();
}
I also mentioned that in reality, our forward mode may be slower than reverse mode in many cases until we put some work into writing more efficient versions of complicated functions.
Forward mode may be slower… but ODE RHS functions are usually very simple algebraic expressions. So I am not sure if we loose too much there in terms of speed.
… I am a bit buried in stuff which is why I haven’t tried yet.
If the expressions are simple, then forward should be a lot faster. At least it will be after we get rid of checks for NaN. I haven’t gotten around to that because it’s tied into all the testing.
Almost all tests for the ODE stuff work correctly - only error conditions when parameters are wrongly given to the integrate functions are not properly throwing, but apart from that the sensitivities appear to be calculated correctly.
This is a very first version and I must say that forward mode is super elegant for these sensitivity calculations. Have a look.
Next I am going to benchmark this thing. I am quite curious on these (though I am not that experienced with writing fast forward mode code).
The ode stress test from the wiki shows a 6% performance increase with the forward mode code when compared to the current develop. This example has 2 states and 4 parameters. Next I will run a 2cmt PK model (3 states, 5 parameters).
Ok, so here is my quick take (example is a 2cmt model with 3 states + 5 parameters):
forward mode in the way I have implemented it is up to 2x slower than what we currently have! The issue, I think, is that for forward mode we evaluate 3+5 times the ODE RHS along with calculating the directed derivatives. Not good.
the current code is really fast I must say. It does respect memory locality and it efficiently implements the Jacobian * adjoint product in a way so that the Jacobian is never stored (just each line is stored one at a time). This is also the reason why breaking out the Jacobian calculation is not a good idea since this causes the Jacobian to be allocated which appears to cost you ~20% performance or so. Really amazing… but @Linas… I have coded up a version of coupled_ode_system which should enable to easily plug in analytic Jacobians. I would still like to find a better of doing it so that we can have a single implementation - but this will be hard.
Still, staring a bit at the ODE code I was able to shave of about ~3-5% of execution time by avoiding allocation of additional temporaries. Not huge, but still. I will file this as a PR soon.
Maybe my first forward mode implementation was not ideal. Obviously there are not huge gains to make (at least for the cases I looked at). However, maybe we can still tweak it. The thing is that the reverse mode AD version evaluates the function once. Then the AD tree is build and recycled for all other derivatives. The question would be if we can do this recycling also for the forward mode. I am not yet familiar with forward mode, but recycling whatever AD structure is build seems to be important.
That’s great because our forward mode isn’t very well optimized as it’s written. We need to fix the inefficient constructors and write some custom derivatives that are more efficient than the ones we have.
On the ODE stress test example I am getting even 10% speed gains. Feel free to review:
To make forward mode work well we need to be able to reuse the expression graph if that’s possible. Reuse of the established AD tree is a key advantage of reverse mode (avoiding repeated ODE RHS evaluations).
Forward mode doesn’t compute an expression graph, so there’s nothing to reuse. Overall, the reuse of the expression graph may not be so huge in other cases given that the forward pass is usually faster than the reverse pass because it’s compiled rather than interpreted (but it does pay the memory overhead).
Also, we don’t want to roll out anything relying on forward mode until we test it more thoroughly. @bbbales2 has a pending math issue I need to review that’s the first step toward a proper autodiff testing framework.