Adj_jac_apply

Well… getting adjoint ODE working would be just awesome. If we can get that to work then this is absolutely the preferred thing to do.

This just seems like a mega project to me and sticking the Jacobians in like proposed will work out of the box (and “my” problems just involve a handful of states… so my selfish me is good). But if I can be of help in the adjoint stuff, I am happy to contribute; though my time is somewhat limited right now.

EDIT: and right forward mode could be useful, but only for half of the cake; so I am not sure what it will buy us (and it’s anyway not yet safe to use).

Hmm, once I go back to school in a couple weeks I gotta be 100% in graduating, non-Stan-dev mode so I don’t want to take on a big project. I think overall adjoint is going to be less trouble than custom C++ery though.

I can probably get this in a proof-of-concept state without too much trouble. I know how to fix the memory issue now which was the biggest thing holding me up. Would you be willing to take over the pull req. if I can get something basic up (and it looks like it’ll be useful)?

Yeah, it’s a half-cake for sure haha.

1 Like

Graduation… who needs that? Knocking Stan’s PRs down is much better of a profession, no? … your intern was/is/will be productive is what I wanna say here…

The adjoint stuff is super cool, yes; the C++ stuff I am suggesting would even go well along with it.

I can do that, yes. I can’t promise on any deadlines or so - and I will probably need some time to digest your stuff, but if we can get this into Stan, this would be just amazing. So assuming we can get this in a PoC state, which we hopefully can declare as successful, then I can take that on as main person, yes (though some help will likely be needed from time to time).

I’ll try to get it in a benchmark-able state sometime in the next few days. This is probably going to have some interactions with the MPI stuff and there could very well be numerical issues or other weirdnesses popping up too that take the edge off it.

Hard to be sure without trying it though haha.

Great!

MPI just needs nested AD. For MPI each job is executed as a nested AD operation. I hope that is not a problem?

And here’s a follow up post with a more interesting adj_jac_apply performance example.

In this case I compared a prim version of simplex_constrain (https://github.com/stan-dev/math/blob/develop/stan/math/prim/mat/fun/simplex_constrain.hpp#L28) to one implemented with adj_jac_apply (https://github.com/stan-dev/math/blob/develop/stan/math/rev/mat/fun/simplex_constrain.hpp#L15).

The benchmark code is here: https://gist.github.com/bbbales2/a1689764f0fda6df561e858026f4e8d9#file-adj_jac_apply_simplex_benchmark_test-cpp

This problem has a triangular Jacobian, but you can compute the stuff you need recursively so again you’re better off not computing that Jacobian or even multiplying by it implicitly.

It looks like a healthy 4x speedup, which is nice. Via a visual inspection both of those lines are straight (though it’s hard to tell for the teal one).

The good news to that is that again the regular autodiff is scaling really nicely. The bad news is that this means we’re gonna need to be pretty careful with any custom derivatives stuff to not end up slower or scaling worse :P.

2 Likes

uhm… just going over your A*x example again…was it a typo to pass in A and x by value and not by reference?

or is that on purpose?

Mistake for sure, they should be const refs. Fixed. Thanks!

Sure… one more question: Do you recommend to code up the functor as lean as possible in terms of memory or is it ok to store some fixed size arrays in there? Storing fixed size arrays in the functor makes the code easier to read as I find, but it would require that you are avoiding unnecessary copies in your implementation.

Fixed arrays are nicer than allocating stuff on the AD stack.

If you can know the sizes of the arrays before compile time, then yeah, go for it, use all the fixed size arrays. The adj_jac_apply varis get allocated in the autodiff memory arena so it’ll end up allocated there, but it’ll be local to the object and you won’t have to have pointers.

I want to make the AD arena memory easier to use for these functors. It’s kinda clunky now.

Ok. And if I am reading your design right, then the user must still define the the non-var (plain double) version of the function. The adj_jac_apply is then used to define an overload, right?

Sorry for all those Q… I am closing in on analytic Jacobians for my ODE…

The magic happens in something like this: https://github.com/stan-dev/math/blob/develop/stan/math/rev/mat/fun/simplex_constrain.hpp#L95

You pass the functor as a template argument to adj_jac_apply and the autodiff arguments as regular arguments. It hooks everything in with autodiff and returns you some vars.

Sure, but there is in addition always a function defined for the double only case. At least this is what is the case for the softmax… and the function you refer to always returns a var type of thing. So this function cannot just return an Eigen::VectorXd thing.

Oh oh sorry, yes, you’re right. You still need to define the prim version (mostly cause that’ll also handle fvars and fvar<var>s and whatever else)

Oh no… I just realized that, if you have a Jacobian already, then using adj_jac_apply to as a way to inject the Jacobian into the ODE solver is gonna be really inefficient.

The issue is you have the Jacobian, supply it to adj_jac_apply, and then the jacobian function in the ODE solver will reconstruct the jacobian one matrix vector product at a time (where only one element of these vectors is non-zero).

precomputed_gradients will be a better option for what you’re doing.

but that adj_jac_apply facility looks far more convenient to code up as I have to deal not that much with the var stuff. Moreover, if there are a ton of zeros I can take advantage of that.

Let’s see how this will come along.

+1. As soon as we get it tested, it’s totally the way to calculate directional derivatives. I don’t know how well that’ll plug into the ODE solvers, but it’ll be faster even for systems with as many outputs as inputs (or even slightly fewer given the lower constant factors).

One issue is that we haven’t optimized any of our forward-mode derivatives. So the gain is in theory and should hold for simple things, but I wouldn’t trust it for complicated matrix stuff (though even then, plugging in the custom types should be reasonable).

You certainly won’t need it for programming, but the union card helps for jobs at universities and research labs.

A 10% speedup is “nice”. 4x is at least “great”, if not “fantastic”. It helps to log scale the y axis for performance reports so we can read off the speedup and see how it depends on N.

Yes. The design is set up to be mutable after the functor call, then immutable until the adjoint call, so you can save intermediate values for use in the Jacobian. You just need to use variables on our autodiff stack.

Yes, indeed. Having zeros in double-based matrix arithmetic’s no problem as they don’t incur overhead. At the same time, the adjoint-Jacobian-apply thing isn’t really set up for sparsity of the whole adjoint vector, as it assumes each adjoint needs to be incremented.

The ideal is to let the implementation of the adjoint-Jacobian product be customizable so that if a function has a sparse Jacobian then its autodiff implementation can take advantage of that, right?

Yurp, that’s right.

Right—it can deal with sparse Jacobians. It just can’t deal with sparseness in the adjoint-Jacobian product result. All those entries get filled in and incremented.