Soliciting syntax ideas for user defined gradients and user defined transformations

Syntactically I’m pretty curious and would love to get the ball rolling on a design doc that tries to add something for user defined gradients and transformations - it’s one of the first steps we need before we can start writing more of Stan in Stan. Here are a couple of loose ideas I’ve seen floating around.

@bbbales2, we talked about using annotations to do something like

@grad_of(foo)
def foo_grad(real y) { ... }
def foo(real y) { ...forward value computation}

Other options include exposing something like operands_and_partials,

def foo(real a, real b) {
  real value = a * b * b;
  a.gradient = b * b;
  b.gradient = a;
  return value;
}

Or we could do some less specific stuff using tuples or records (briefly mentioned in this other thread: Stanc3 type system overhaul), but I think those are less good for our audience…

Anyone else have other ideas or know of other examples from other languages? I feel like there must have been threads about this already but discourse search and I are so bad together that we can’t find any.

//cc @Bob_Carpenter @wds15 @yizhang @avehtari @James_Savage @aaronjg @andre.pfeuffer @enetsee @Matthijs

1 Like

What I like about the operands and partials thing you throw up is that the user can reuse expressions which have been calculated. That’s a pretty big thing which you can’t do with the other annotation thing.

It would be cool if we can grab the sparsity structure of the Jacobean for functions with multiple outputs…if that’s in scope…

1 Like

I have two desiderata for this.

  1. If we have to represent an explicit dense Jacobian for functions with multivariate output, we’re toast. Just consider something like matrix inverse—that’s an N^4 Jacobian for an N \times N matrix. Therefore, a model like operands-and-partials can be at most a partial solution, at least until it can be sparse as @wds15 is suggesting .

  2. If we have to compute derivatives with respect to non-parameter inputs, we’re losing a lot of efficiency. At the same time, if we comput them separately without sharing any computation, we’re also toast.

Reverse mode

So I’ve been thinking we need something along the lines of an adjoint-jacobian product approach. Suppose we have a function of type

T0 foo(T1 x1, ..., TN xN)

for which we’d like to define derivatives.

Conceptually for reverse mode, what we need is to be able to compute the function value (using double inputs) in the forward pass of reverse mode. Then in the the revese pass, when we hit the result y = foo(x1, ..., xN), for each input that’s non-constant, we need to add the product of the adjoint of y with the Jacobian of y with resecpt to xn and then add the result to the adjoint of xn. All of these operations only involve double (and int) values.

One way to do that without fully satisfying desideratum (2) is to write paired functions for the value and the adjoint-Jacobian product with respect to each input.

T0 foo(T1 x1, ...., TN xN) { ... }

T1 foo_adj_jac_prod_1(T0 y, T0 y_adj, T1 x1, ..., TN xN) { ... }

...
TN foo_adj_jac_prod_N(T0 y, T0 y_adj, T1 x1, ..., TN xN) { ... }

The first function foo defines the operation of the function itself as usual. That function gets executed in the forward pass of reverse mode with double arguments. We only need to save a reference to the output and each of the inputs as those will contain the relevant pieces of info we need in the reverse pass.

At the point in the reverse pass where we need to execute the chain() operation on output y, we pull out the value y and all inputs x1, ..., XN as double values as well as the adjoint y_adj as double values, then pass them all into the adjoint-Jacobian function. The result is then added to the adjoint of xn.

Forward mode

We need to multiply the input’s tangent value by the Jacobian of the output w.r.t. that input to get a term in the output’s tangent. So we need functions with signatures like this for each input

T0 foo_jacobian_tan_1(T1 x1, ..., TN xN, Tn x1_tan) { ... }

TN foo_jacobian_tan_1(T1 x1, ..., TN xN, Tn xN_tan) { ... }

Then we just apply that to all the non-const inputs and add the results into the tangent of y. We might need to include the other tangents as arguments, but in that case, I don’t know how to deal with ones that aren’t defined because the input’s a constant.

Jacobians as a shortcut

If we have the Jacobian w.r.t. each input fully expressed as a templated function, we could use it to implement all of the above or just use it directly.

1 Like

I talked to @Bob_Carpenter about this and he mentioned that the adjoint functions might also need to have access to the original arguments as well.

Given that I kinda think we might need functor-like functions in Stan that have local storage to make this work. Maybe something that isn’t too far even from how the adjoint jacobian stuff is implemented in C++ now. This one makes use of temporary storage: https://github.com/stan-dev/math/blob/develop/stan/math/rev/mat/fun/simplex_constrain.hpp .

For the linear algebra stuff these temporary values are non-trivial.

So something more monolithic like you are suggesting here: Stanc3 type system overhaul - #10 by seantalts

+1
There are bunch of examples like GPs and Kalman filters, where the output is low dimensional (so that Jacobian is also small), but there are many internal variables/states and lot of computation which can be re-used.

Doesn’t a operands and partials approach where one implants a Jacobian-adjoint product (with the adjoint passed in the signature) instead of the components of the gradient satisfy the desire to reuse local expressions and avoid dense Jacobians?

1 Like

operands_and_partials is so named because that’s what it stores after the forward pass. I wouldn’t want to use that name for something with different behavior. But in this case, we already have the cleverly named adj_jac_apply function that does something like operands_and_partials, only lazily.

Let me stress again that the problem with calculating derivatives for all inputs at once is that you only need that if all inputs are variable type (not constants). Usually, a subset of the inputs are double-based and don’t need derivatives. So I’m not sure what’s the best thing to do here—we don’t want to compute derivatives we don’t use and we don’t want to have to write 2^N derivatives for N-ary functions.

I still think we can augment code analysis to do this for us. Certainly simple cases are easy; if a is a double in our example, we just remove the assignment to a.gradient and let dead code removal take over.

Regarding adj_jac_apply - I see no reason to not allow multiple ways of specifying a user-defined gradient in the Stan language, as there are different performance considerations leading one to prefer different formulations of the gradient. So for densities and other functionals (functions from \mathbb{R}^n \mapsto \mathbb{R}) we would be better off using a single function or something like operands and partials, but for large multivariate functions we’d be better off using something like adj_jac_apply. Alternatively, we could plan ahead and hope to use the imminent sparse matrix support for efficient multivariate gradients using something like operands and partials. Capturing sparsity structure should be doable, I think - maybe @stevebronder can comment.

That’d be super helpful.

The case I’m wondering about is where I build intermediate A because it’s useful in gradients B and C. But if I only need the gradient for B, I might not want to build A. But that might not be so common.

We only need two ways to specify derivatives at root: the adjoint method (reverse mode) and the tangent method (forward mode). That’s the abstraction which they’re built—the chain() method on a vari implements the adjoint-Jacobian sum operation. We can layer things like operands-and-partials on top of those, but it bottoms out in those definitions.

Sparse matrix structures are terribly inefficient, whereas writing custom adjoint-Jacobians is pretty trivial. And remember, it’s not sparse matrices we need, but sparse structured tensors. If we have an N \times N matrix x and compute its inverse x^{-1}, the Jacobian is (N \times N) \times (N \times N) and heavily structured.

Just consider something like addition. The adjoint-Jacobian is the identity. To code that using operands-and-partials, we’d code up a unit matrix and get a ton of indexing and multiply by 1 operations. Now we could template that out, but it gets a lot more complicated to do that all more generically. Whereas in the implicit adjoint-Jacobian case we’d just use the identity function, which should be faster.

For more complicated cases, check out Giles’s matrix autodiff paper or the Matrix Cookbook (though the latter involves more finagling mathematically for autodiff).

Let me try to clarify why we can’t have anything relying on Jacobians, sparse or otherwise, as our basis. Let’s take a matrix operation from Giles’s paper that also happens to be a key function in Stan.

C = A \, \backslash \, B = A^{-1} \, B.

where all of A, B, C are N \times N matrices with adjoints \bar{A}, \bar{B}, \bar{C} (i.e., Matrix<var, -1, -1> types in C++). The Jacobian is going to be dense and have 2N^4 entries! On the other hand, the chain() method for C need only do this:

\bar{B} \ \textrm{+=} \ \left( A^{-1} \right)^{\top} \, \bar{C}
\bar{A} \ \textrm{+=} \ -\bar{B} \, C^{\top}.

Those operations on the right encapsulate an efficient calculation of the Jacobian-adjoint product, which is then added to the adjoint. For numerical stability, we presumably do the adjoint for \bar{B} using another division—otherwise, if we computed this as written, we’d store the intermediate A^{-1}, which is only N^2.