Repeated AD of function with varying arguments?


#1

Hi!

I was wondering if it would make sense (or is possible) to reuse the AD tree of a function when I change the argument values?

As I understand from the arXiv paper it is much more efficient to reuse the AD tree for a function when I want to get the Jacobian and as such do AD for different outputs while I always have the same inputs.

But shouldn’t it be possible to extend this recycling principle to the case when I want to get the Jacobian of the same function (hence same tree), but for different inputs?

I was looking through the code and found that val_ has the type const double. So is my thought wrong here?

If this could work, then we could potentially get some ODE speedups since we evaluate the Jacobian of the very same function very often. I guess, this could benefit others parts of Stan as well…

Best,
Sebastian

PS: Do such posts belong here? This is more of a stan-math users question…


F: R^N -> R^M - Jacobian for M >> N - is forward-mode more efficient?
#2

wds15 Developer
October 17
Hi!

I was wondering if it would make sense (or is possible) to reuse the AD tree of a function when I change the argument values?

As I understand from the arXiv paper it is much more efficient to reuse the AD tree for a function when I want to get the Jacobian and as such do AD for different outputs while I always have the same inputs.

But shouldn’t it be possible to extend this recycling principle to the case when I want to get the Jacobian of the same function (hence same tree), but for different inputs?

No. The values get built up along with the vari instances
in the forward pass. No way to reuse them. But even if there
was, it’d be less efficient than just building everything again if
all the calls have to be virtual.

I was looking through the code and found that val_ has the type const double. So is my thought wrong here?

Just removing the const isn’t enough—look at how they’re computed.

  • Bob

If this could work, then we could potentially get some ODE speedups since we evaluate the Jacobian of the very same function very often. I guess, this could benefit others parts of Stan as well…

What would be more efficient would be to code generate the function
and derivatives so neither were virtual functions and they could be
reused. If you do that, you have to prohibit any branching behavior
based on parameter values. Otherwise, the expression graph may be
different for different inputs.

Best,
Sebastian

PS: Do such posts belong here? This is more of a stan-math users question…

Either list (users or dev) is OK.

  • Bob

#3

I do not know the Stan reverse AD implementation details, but I guess it is not possible.

Most of the time, reverse AD libraries use operator overloading to record elementary operations g1, g2, g3…

Hence your code to compute a function f:R^n->R^m is equivalent to

f = … g3 ° g2 ° g1.

Using chain rule the differential is

df =… dg3 ° dg2 ° dg1.

However to compute gradients we work with the differential adjoint

df^t = dg1^t ° dg2^t ° dg3^t ° …

The reason is that gradient grad f_j of your function f_j:R^n->R is the j-column of df^t.

Hence using adjoint allows you to compute this vector with the simple matrix-vector product

df^t.e_j

where e_j[k]=1 if k=j, 0 otherwise.

We reuse this for j=1…m to compute all your grad f_j.

However you can not reuse this for different input value x_i.

A typical example is when branching occurs:

if(x1<x2) {
// A
f1 = x1+x2;
f2 = x1;
f3 = x2;
}
else {
// B
f1 = x1*x2;
f2 = x1-x1;
f3 = x2+x2;
}

if during the direct pass you have x1<x2 you will record operations of branch A, this is ok to compute your 3 gradients grad f1, grad f2, grad f3 but you can not change values of x1 and x2 without redoing a global direct pass to create a new record. (imagine that your new values are now x2>x1 -> branch B).

Also note that using adjoint is the reason why we must record all operations:
df^t = dg1^t ° dg2^t…
( dg1 is computed first, but dg1^t is the last operation in df^t, hence it must be recorded)

I hope the explanation is clear enough, not easy to explain shortly with my approximate English.

Vincent


#4

Hi!

Thanks much for the explanation!

Being naive to a topic is a good thing sometimes, but here I should have clearly read a bit more in details first…

Thanks much… and your language was completely fine for me.

Sebastian


#5

vincent-picaud Developer
October 17
I do not know the Stan reverse AD implementation details, but I guess it is not possible.

See: https://arxiv.org/abs/1509.07164

Most of the time, reverse AD libraries use operator overloading to record elementary operations g1, g2, g3…

That’s also what Stan does.

Hence your code to compute a function f:R^n->R^m is equivalent to

f = … g3 ° g2 ° g1.

Using chain rule the differential is

df =… dg3 ° dg2 ° dg1.

However to compute gradients we work with the differential adjoint

df^t = dg1^t ° dg2^t ° dg3^t ° …

The reason is that gradient grad f_j of your function f_j:R^n->R is the j-column of df^t.

Hence using adjoint allows you to compute this vector with the simple matrix-vector product

df^t.e_j

where e_j[k]=1 if k=j, 0 otherwise.

We reuse this for j=1…m to compute all your grad f_j.

However you can not reuse this for different input value x_i.

A typical example is when branching occurs:

if(x1<x2) {
// A
f1 = x1+x2;
f2 = x1;
f3 = x2;
}
else {
// B
f1 = x1*x2;
f2 = x1-x1;
f3 = x2+x2;
}

if during the direct pass you have x1<x2 you will record operations of branch A, this is ok to compute your 3 gradients grad f1, grad f2, grad f3 but you can not change values of x1 and x2 without redoing a global direct pass to create a new record. (imagine that your new values are now x2>x1 -> branch B).

Also note that using adjoint is the reason why we must record all operations:
df^t = dg1^t ° dg2^t…
( dg1 is computed first, but dg1^t is the last operation in df^t, hence it must be recorded)

I hope the explanation is clear enough, not easy to explain shortly with my approximate English.

Not only clear, but correct! Thanks.

  • Bob