Speed of forward mode vs reverse mode

wds15 · September 6, 2018, 3:39pm

Hi all!

I am trying to find ways to speed up ODE integration. One bottleneck is the need to do AD on the user-supplied ODE functor really often. Right now this is done using reverse mode AD and I was wondering if forward mode is a good candidate to be faster here. Bear in mind that the ODE functors are usually darn simple algebraic expressions.

Since the J_y * y + J_p is needed, the forward mode would need to be run a few times in order to get the gradient wrt to each parameter; so forward mode would need to be quite a bit faster than reverse to make up for that.

Does someone have a guess here?

Best,
Sebastian

(if this gives a speed bump, the next question is if we can use forward mode on ODE functors any time soon)

bbbales2 · September 6, 2018, 3:47pm

If you have N states and M parameters, I think you only need to run forward mode M + 1 times to get the coupled system (M for all the J_p terms and then 1 for the J_y y term).

You need N reverse mode autodiffs to get all this. So I think it depends on your problem.

wds15 · September 6, 2018, 4:09pm

Typically M is similar or greater to N in my applications. Anyway, I recall that you said that no virtual function calls are involved in forward mode which makes it faster. This is why I was hoping to get speedups even if we do some extra work.

Maybe I should just try it out once.

EDIT: but there are certainly N >> M cases which I know. Maybe we should actually make a runtime decision as to what we run (given we find convincing examples).

bbbales2 · September 6, 2018, 4:20pm

Ggogogogogo. When the variable names are vars I mean appropriately sized and typed fvar things actually lol.

J_y y:

for(int i = 0; i < y.size(); i++) {
  y_var[i].d_ = y[i]; // EDIT ALEEEEEERTTTTT: I ADDED AN [i] HERE ON THE RIGHT HAND SIDE
}

vector<fvar<double> > dy_dt_var = f_(t, y_var, thetavar, x_, x_int_, msgs_);

for(int j = 0; j < y.size(); j++)
  Jy_times_y(j, i) = dy_dt_var[j].tangent();
}

J_p:

for(int i = 0; i < theta_.size(); i++) {
  if(i > 0)
    thetavar[i - 1].d_ = 0.0;

  thetavar[i].d_ = 1.0;

  vector<fvar<double> > dy = f_(t, y_var, thetavar, x_, x_int_, msgs_);

  for(int j = 0; j < dy.size(); j++)
    Jtheta(j, i) = dy[j].tangent();
}

EDIT ALERT: The code in the first block changed

Bob_Carpenter · September 20, 2018, 12:54pm

Yes, the code can be fully compiled and inlined.

I also mentioned that in reality, our forward mode may be slower than reverse mode in many cases until we put some work into writing more efficient versions of complicated functions.

wds15 · September 21, 2018, 1:41pm

Forward mode may be slower… but ODE RHS functions are usually very simple algebraic expressions. So I am not sure if we loose too much there in terms of speed.

… I am a bit buried in stuff which is why I haven’t tried yet.

Bob_Carpenter · September 22, 2018, 2:44pm

If the expressions are simple, then forward should be a lot faster. At least it will be after we get rid of checks for NaN. I haven’t gotten around to that because it’s tied into all the testing.

wds15 · October 12, 2018, 7:10pm

Ok… here we go:

github.com

stan-dev/math/blob/feature/ode-fwd/stan/math/rev/arr/functor/coupled_ode_system.hpp

#ifndef STAN_MATH_REV_ARR_FUNCTOR_COUPLED_ODE_SYSTEM_HPP
#define STAN_MATH_REV_ARR_FUNCTOR_COUPLED_ODE_SYSTEM_HPP

#include <stan/math/prim/mat/fun/typedefs.hpp>

#include <stan/math/prim/arr/meta/get.hpp>
#include <stan/math/prim/arr/meta/length.hpp>
#include <stan/math/prim/arr/functor/coupled_ode_system.hpp>
#include <stan/math/prim/arr/fun/value_of.hpp>
#include <stan/math/prim/scal/err/check_size_match.hpp>
#include <stan/math/rev/scal/fun/value_of_rec.hpp>
#include <stan/math/rev/core.hpp>

#include <stan/math/fwd/core.hpp>
#include <stan/math/prim/mat/fun/Eigen.hpp>

#include <ostream>
#include <stdexcept>
#include <vector>

This file has been truncated. show original

Almost all tests for the ODE stuff work correctly - only error conditions when parameters are wrongly given to the integrate functions are not properly throwing, but apart from that the sensitivities appear to be calculated correctly.

This is a very first version and I must say that forward mode is super elegant for these sensitivity calculations. Have a look.

Next I am going to benchmark this thing. I am quite curious on these (though I am not that experienced with writing fast forward mode code).

wds15 · October 13, 2018, 3:38pm

The ode stress test from the wiki shows a 6% performance increase with the forward mode code when compared to the current develop. This example has 2 states and 4 parameters. Next I will run a 2cmt PK model (3 states, 5 parameters).

wds15 · October 15, 2018, 7:28am

Ok, so here is my quick take (example is a 2cmt model with 3 states + 5 parameters):

forward mode in the way I have implemented it is up to 2x slower than what we currently have! The issue, I think, is that for forward mode we evaluate 3+5 times the ODE RHS along with calculating the directed derivatives. Not good.
the current code is really fast I must say. It does respect memory locality and it efficiently implements the Jacobian * adjoint product in a way so that the Jacobian is never stored (just each line is stored one at a time). This is also the reason why breaking out the Jacobian calculation is not a good idea since this causes the Jacobian to be allocated which appears to cost you ~20% performance or so. Really amazing… but @Linas… I have coded up a version of coupled_ode_system which should enable to easily plug in analytic Jacobians. I would still like to find a better of doing it so that we can have a single implementation - but this will be hard.

Still, staring a bit at the ODE code I was able to shave of about ~3-5% of execution time by avoiding allocation of additional temporaries. Not huge, but still. I will file this as a PR soon.

Maybe my first forward mode implementation was not ideal. Obviously there are not huge gains to make (at least for the cases I looked at). However, maybe we can still tweak it. The thing is that the reverse mode AD version evaluates the function once. Then the AD tree is build and recycled for all other derivatives. The question would be if we can do this recycling also for the forward mode. I am not yet familiar with forward mode, but recycling whatever AD structure is build seems to be important.

Bob_Carpenter · October 16, 2018, 5:09pm

That’s great because our forward mode isn’t very well optimized as it’s written. We need to fix the inefficient constructors and write some custom derivatives that are more efficient than the ones we have.

wds15 · October 16, 2018, 7:24pm

On the ODE stress test example I am getting even 10% speed gains. Feel free to review:

To make forward mode work well we need to be able to reuse the expression graph if that’s possible. Reuse of the established AD tree is a key advantage of reverse mode (avoiding repeated ODE RHS evaluations).

Bob_Carpenter · October 22, 2018, 1:16am

Forward mode doesn’t compute an expression graph, so there’s nothing to reuse. Overall, the reuse of the expression graph may not be so huge in other cases given that the forward pass is usually faster than the reverse pass because it’s compiled rather than interpreted (but it does pay the memory overhead).

Also, we don’t want to roll out anything relying on forward mode until we test it more thoroughly. @bbbales2 has a pending math issue I need to review that’s the first step toward a proper autodiff testing framework.

Topic		Replies	Views
DAE performance improvement Developers	10	691	March 1, 2022
Adjoint sensitivities Algorithms	25	4733	July 23, 2019
Adding gradients - operands and partials vs. fwd/rev Developers	4	807	November 29, 2018
Algebraic Solver Differentiation Speedup Developers performance	23	1462	September 22, 2021
Fully working parallel ODE Stan using OpenMP Developers	2	805	February 6, 2017

Speed of forward mode vs reverse mode

Related topics