I keep talking about it in meetings, but I’ve given very little info on exactly how the adj_jac_apply thing @Bob_Carpenter and I have been working on.

I’m gonna start a thread to pull the covers off on that one a bit and start showing it off. There’s various people I want to brainwash to start using it. Some people know I want to brainwash them (@yizhang and @anon75146577), and some people do not (@anon79882417), but I’ll keep @ ing people as this develops and gets fancier.

It’s still sitting as a pull: https://github.com/stan-dev/math/pull/964 , so it isn’t totally ready to use yet, but here is an example of using it to re-implement solving Ax = b: https://gist.github.com/bbbales2/523db7848df48117dd39788815dfe1a2

What makes adj_jac_apply different from the other ways of implementing autodiff:

1. You don’t need to compute the Jacobian of your operator, just a vector^T Jacobian product (which if your Jacobian is sparse can be faster than doing the full Jacobian)
2. Each adj_jac_apply will only create one vari on the chaining autodiff stack, no matter how many vars your function outputs
3. This automatically handles input types that are mixes of vars and double types

The major limitation of adj_jac_apply is if you want to save variables between the forward pass (operator()) and the reverse pass (chain), you gotta do that using the autodiff memory arena. This means you gotta manage the raw memory of your Eigen types (this is shown in the solve example above).

It also only works with reverse mode. No fvars or higher order autodiff.

5 Likes

Would it make sense to switch over stuff which outputs a lot of things and anyway only work in the rev mode to this approach (once all the types are supported)?

For example, the ode integrators output per output time-point and per state of the ODE a var. So whenever we have lot’s of output, then the AD tree will grow a lot. If I get you right, then we can avoid this with a single var when we use this approach. Currently the ODE outputs are nested arrays; not sure if this facility will deal with this at some point.

The same thing holds for map_rect, for example. Not sure how much performance this buys us, but for large models it may be worth it.

Having coded up myself poisson_lpmf with this, I think that this is super cool, but we probably need to make facilities available to take from users away the burden to mess with low-level pointers and memory allocation themselves. As a start having some examples will be important.

Huge! Possibly made what I was doing a lot easier. Going to drop what I was doing and give this tests on some kernels… I can get back may be tomorrow on what I find

I’ve been a bit sloppy with how I’m saying it. If there are N outputs, there are still N vars and N varis, but only one of those varis gets put on the autodiff stack and has its chain method called. It’s an implementation detail that can have a big impact.

Yeah it’s totally possible. I’ll keep the nested arrays in mind when I work on the outputs

Yeah, I agree. I’ve been thinking I’ll code up at least some simple container objects for the Eigen types. I’ll put something out soon on that. People coding with this always have to stay conscious of how the arena memory works, but we can make it more convenient at least.

Haha, don’t drop it yet. What you’re doing needs std::vector<Eigen::Matrix<T, R, C>> input types as well as Eigen::Matrix outputs. It’ll be along soon though!

edit: spelling correction

1 Like

@anon75146577 here’s the addition operator you ordered! @Bob_Carpenter – you’ll be interested in the benchmarks at the end.

struct AddFunctor {
template <std::size_t size>
Eigen::VectorXd operator()(const std::array<bool, size>& needs_adj, const Eigen::VectorXd& x1, const Eigen::VectorXd& x2) {
return x1 + x2;
}

template <std::size_t size>
}
};


This is compilable from the current stan-dev/math develop branch.

This is equivalent to the prim implementation:

auto AddFunctorAutodiffed = [](auto& x1, auto& x2) {
return (x1 + x2).eval();
};


And in the spirit of Checking That Things We Write Actually Work, I ran some benchmarks. I compared the prim implementation above to the adj_jac_apply implementation. I also coded up an “inefficient” adj_jac_apply that computes a full Jacobian as a comparison.

I expected the autodiff and efficient adj_jac_apply would both be fast, and the adj_jac_apply faster for large vectors (cause there’s way less chain calls). Turns out adj_jac_apply is about 20% slower than the purely prim implementation. I guess that means my processor is better at virtual function calls than I gave it credit for, or shufflying around the double-datatypes in adj_jac_apply is more expensive than I thought. These are the numbers:

The inefficient implementation is, of course, bad. And I guess this should just be a Warning that unless this stuff is used craftily, you can still end up slowing your code down:

I’m going to compare a prim vs. adj_jac_apply implementation of a more complicated function to get a handle on this (simplex_constrain, but I’ll get to that later). Looks like we’re gonna need to be careful when using this to make sure the complexity of our adj_jac_apply doesn’t exceed the regular autodiff! It’s sneakily efficient it seems.

Full test benchmark code is here: https://gist.github.com/bbbales2/a1689764f0fda6df561e858026f4e8d9

1 Like

I reported on this in the arXiv paper. It turns out that basic matrix multiply with autodiff in Eigen (assuming template traits instantiated properly) is faster than our custom version (or at least it was—I think Rob may have since cleaned things up in our custom version). But our custom version is much more memory efficient, so it’s still a win. We don’t want to store out N^2 \times N^2 Jacobians when we can avoid it!

Addition is very efficient in basic autodiff the way I wrote it with a chain() method, as it’s just increments the operand’s adjoint with the result’s adjoint. No need to store the partial (constant 1) or multiply it. It’s the biggest win for the lazy approach built into our basic autodiff. Your mileage will vary tremendously elsewhere.

@bbbales2: What do you think about using adj_jac_apply to essentially plug analytic derivatives into the ODE solver? I mean, I really would like to try out to use your shiny new facility in order to define an ODE system functor with analytic Jacobians (the ODE system functor would just call another function which is defined using adj_jac_apply).

Have you thought of using it this way? Do you see why this would not be a good idea? I mean, there will still be minimal AD going on during ODE integration, but this approach should get us really close to getting rid of AD during ODE integration. I hope I find soon the time to code this idea up, but if it works, I will try to go one step further and resurrect my old R code which automatically spits out the analytic Jacobians for simple algebraic ODE system functors. Thus, we are really close to having a principled way to provide the Jacobian for the purpose of ODE integration… your 50 states system you mentioned a while ago should massively gain from such an approach, I would guess.

@bgoodri Is it a good thought to plug this code generation into rstan (this would not be restricted to ODE system functors, but rather to what R can symbolically differentiate)?

I’d prefer to get the ODE autodiff scaling the right way before doing custom Jacobians. That means adjoint sensitivity :P. If you wanna work on that, some very shaky pieces are here: https://github.com/bbbales2/math/pull/2/files . I’m down to Skype about it, but I gotta finish up the other stuff I’ve started before I work on something like that.

To answer your question though, you could use this for custom Jacobians, sure.

I’d also be curious to see what would happen if the current sensitivity stuff way re-engineered to use forward mode autodiff. Forward mode is pretty fast cause it doesn’t have to deal with a stack.

The sensitivity eqs. are:

s(t) = \frac{d y(t)}{dp}

\frac{\partial s}{\partial t} = f_y s + f_p

Forward mode can compute J v easily, so you can use it to compute f_y s with one pass. Reverse mode does v^T J, which doesn’t show up there. If you have M parameters, you’d still need M passes to get all the f_p entries you need.

Well… getting adjoint ODE working would be just awesome. If we can get that to work then this is absolutely the preferred thing to do.

This just seems like a mega project to me and sticking the Jacobians in like proposed will work out of the box (and “my” problems just involve a handful of states… so my selfish me is good). But if I can be of help in the adjoint stuff, I am happy to contribute; though my time is somewhat limited right now.

EDIT: and right forward mode could be useful, but only for half of the cake; so I am not sure what it will buy us (and it’s anyway not yet safe to use).

Hmm, once I go back to school in a couple weeks I gotta be 100% in graduating, non-Stan-dev mode so I don’t want to take on a big project. I think overall adjoint is going to be less trouble than custom C++ery though.

I can probably get this in a proof-of-concept state without too much trouble. I know how to fix the memory issue now which was the biggest thing holding me up. Would you be willing to take over the pull req. if I can get something basic up (and it looks like it’ll be useful)?

Yeah, it’s a half-cake for sure haha.

1 Like

Graduation… who needs that? Knocking Stan’s PRs down is much better of a profession, no? … your intern was/is/will be productive is what I wanna say here…

The adjoint stuff is super cool, yes; the C++ stuff I am suggesting would even go well along with it.

I can do that, yes. I can’t promise on any deadlines or so - and I will probably need some time to digest your stuff, but if we can get this into Stan, this would be just amazing. So assuming we can get this in a PoC state, which we hopefully can declare as successful, then I can take that on as main person, yes (though some help will likely be needed from time to time).

I’ll try to get it in a benchmark-able state sometime in the next few days. This is probably going to have some interactions with the MPI stuff and there could very well be numerical issues or other weirdnesses popping up too that take the edge off it.

Hard to be sure without trying it though haha.

Great!

MPI just needs nested AD. For MPI each job is executed as a nested AD operation. I hope that is not a problem?

And here’s a follow up post with a more interesting adj_jac_apply performance example.

In this case I compared a prim version of simplex_constrain (https://github.com/stan-dev/math/blob/develop/stan/math/prim/mat/fun/simplex_constrain.hpp#L28) to one implemented with adj_jac_apply (https://github.com/stan-dev/math/blob/develop/stan/math/rev/mat/fun/simplex_constrain.hpp#L15).

This problem has a triangular Jacobian, but you can compute the stuff you need recursively so again you’re better off not computing that Jacobian or even multiplying by it implicitly.

It looks like a healthy 4x speedup, which is nice. Via a visual inspection both of those lines are straight (though it’s hard to tell for the teal one).

The good news to that is that again the regular autodiff is scaling really nicely. The bad news is that this means we’re gonna need to be pretty careful with any custom derivatives stuff to not end up slower or scaling worse :P.

2 Likes

uhm… just going over your A*x example again…was it a typo to pass in A and x by value and not by reference?

or is that on purpose?

Mistake for sure, they should be const refs. Fixed. Thanks!

Sure… one more question: Do you recommend to code up the functor as lean as possible in terms of memory or is it ok to store some fixed size arrays in there? Storing fixed size arrays in the functor makes the code easier to read as I find, but it would require that you are avoiding unnecessary copies in your implementation.

Fixed arrays are nicer than allocating stuff on the AD stack.

If you can know the sizes of the arrays before compile time, then yeah, go for it, use all the fixed size arrays. The adj_jac_apply varis get allocated in the autodiff memory arena so it’ll end up allocated there, but it’ll be local to the object and you won’t have to have pointers.

I want to make the AD arena memory easier to use for these functors. It’s kinda clunky now.

Ok. And if I am reading your design right, then the user must still define the the non-var (plain double) version of the function. The adj_jac_apply is then used to define an overload, right?

Sorry for all those Q… I am closing in on analytic Jacobians for my ODE…