Help write an autodiff handbook

I’m writing here to ask for help in completing an introduction to autodiff with an encyclopedic coverage of forward- and reverse-mode tangent and adjoint rules. I put myself down as “editor”, but I’m going to manage the whole thing open source and give everyone who contributed credit so that it’ll be like one of those bio or physics papers if lots of people get involved.

People have been urging me to revise our autodiff paper for Stan and extend it to forward-mode. Stan’s forward mode is simple and generic, so instead, I decided to write a mini-textbook/encyclopedia on autodiff based on Mike Giles’s extended matrix autodiff paper. It’s a public repo on Git with code licensed under BSD:

But I need help completing the tangent and adjoint rules for further matrix results, diff eq solvers, algebraic equation solvers, 1D integrators, etc.—all the fun stuff we support in Stan.

The draft is written in bookdown and there are makefiles and build scripts or you can just go into RStudio, load bookdown, and type render(index.Rmd).

In addition to expanded derivations following Giles, there’s

  • intro to forward-mode
  • intro to reverse mode
  • intro to mixed mode
  • HMM derivatives (Qin, Auerbach, and Sachs 2000)
  • worked examples
  • working C++ reference code

The reverse-mode code is based on the continuation-based reverse-mode I defined as an exercise on Discourse a while ago. The forward-mode code and functionals are based on Stan’s math library.

Feedback is also welcome, either here, through GitHub issues, or via email at carp@alias-i.com.

10 Likes

I would suggest incorporating whatever is new in section 4 of

and at least giving a shoutout to

http://www.matrixcalculus.org/

1 Like

@charlesm93 wrote a nice paper on this! https://arxiv.org/abs/1811.05031

I’m happy to read stuff but I don’t think I’ve got much to add (except for the eye of someone who is bad with expression trees)

2 Likes

Thanks, @bgoodri. I’ll check it out. I hadn’t seen matrixcalculus.org—that looks super cool. It’d be great if we could work out the derivatives for our transforms and code those more efficiently.

I read the section of Dougal Maclaurin’s Ph.D. thesis on Autograd, which was interesting. They went with immutable matrices. I think we might be able to go down that route with Stan if we allow them to be built from something like a real(int, int) function and a pair of sizes (int, int).

@anon75146577: No expression graph yet. Just a lot of adjoint and tangent math. And a C++ reference implementation. I was hoping I could convince @charlesm93 to write the diff eq chapter.

1 Like

The www.matrixcalculus.org people also have http://www.geno-project.org/ where you can differentiate an objective function with respect to matrices and vectors using a simple language. That is less useful for the handbook but maybe more useful for stuffing everything into a function that evaluates a log-kernel.

3 Likes

I was hoping I could convince @charlesm93 to write the diff eq chapter.

Heck yeah, let’s talk!

I have some new stuff on differentiating the Laplace approximation that really nails down several important concepts: the utility of forward and reverse mode, the benefits of analytical “super nodes”, and a pretty sweat plot twist that involves finding the right initial cotangent. All this works also relates, to some degree, to discussions @seantalts and I had on differentiating algebraic solvers, and which I need to revisit.

@betanalpha, Vianey, and I have, as you know, things cooking for HMMs. This week, I’m writing it all up in Stan (based on some extensive C++ code produced by Michael). Michael and I – mostly Michael – rederived the ODE adjoint methods, which is better than what we currently have in Stan.

I recommend some of the discussion by @betanalpha on higher-order autodiff: https://arxiv.org/pdf/1812.11592.pdf.

2 Likes

The specific bookdown requires a bunch of extra packages so it doesn’t render out of the box. A requirements listing would be helpful if you’re not going to push the html and pdf to the repo for each access.

In any case I highly recommend taking the introduction strategy that @charlesm93 used in his review, https://arxiv.org/pdf/1811.05031.pdf. Getting passed the “autodiff is just implementing the chain rule” to “autodiff propagates differential information through functions” better prepares the reader for autodiff architectures and more subtle concepts like the need to implementing only Jacobian-vector products instead of full Jacobians.

1 Like

Expression graphs could be done for example using the diagram package (maybe you know better options but it seems good to me). Here is how to do one for the log(uv) example currently in section 4.1

library(diagram)
A <- matrix(0, 5, 5)
colnames(A) <- c("log", "*", "e", "u", "v")
A[2:3,1] <- 1
A[4:5,2] <- 1
pos <- matrix(c(0.6, 0.4, 0.8, 0.2, 0.6, 0.8, 0.5, 0.5, 0.2, 0.2), 5, 2, byrow = FALSE)
plotmat(A, curve = 0, arr.type = "triangle", arr.width = 0.5, 
    box.lcol = c("firebrick", "black", "black", "steelblue", "steelblue"), 
    cex.txt = 0, pos = pos, box.size = 0.06)

It renders this:

3 Likes

@Bob_Carpenter I can have a go at writing the chapter on ODEs and algebraic differential equations. If you give me permission, I can create a branch on the GitHub repo.

Sorry for the long delay—I forgot about this thread over in publicity.

The requirements are the requirements for bookdown. Same as for our manuals.

I’m adding requirements as people let me know what they are, but I did it so long ago for myself I don’t have a list.

That is fairly subtle, but I think I see what you mean. The first erroneously implies (at least on some readings) that you just literally calcuate the Jacobian then multiply in some high-dimesnional matrix-like space.

There’s definitely more than just bookdown as I originally tried to compile this not long after getting bookdown working for another project. I think you’re using optional themes at the very least.

Multiplying Jacobian matrices in arbitrary order would definitely lead to some pretty nasty performance issues. It would also be harder to implement because you wouldn’t have pure matrix multiplication when dealing with multidimensional inputs and the like.

There’s also the conceptual issue, however, that we’re not computing some big matrix but rather carrying things along (and against) the flow of the computation. I think this conceptual issue is burden for people first learning autodiff (the concepts being presented not aligned with the general structure of the calculations seen in examples) and obfuscates many important properties. For example once one recognizes the flow/propagation nature of autodiff then general properties like “the cost of evaluating the gradient (or anything computable in a O(1) number of sweeps) is the same order as the cost of evaluating the function” become much more apparent.

Honestly I’m not sure well-formed most readers’ intuitions about calculus and the chain rule will be so I think the intro design is less about meeting readers at some prerequisite knowledge and more about creating a smoother pedagogy.

2 Likes

That’s just the Tufte package. And that I did document. And there’s also pandoc-siteproc or something like that.

I found autodiff pretty easy to understand once I found a computational example. But then I had a lot of experience with lazy evaluation in general, and continuations in particular, before I started, which is what’s really going on conceptually.

+1 to that!

I propose grouping the chapters on ODEs, algebraic equations, and integrals, into one section on implicit functions. This is in part because for all these methods, we can apply the adjoint method to construct differentiation algorithms.

The content would be as follows:

Automatic differentiation for implicit functions

  1. Algebraic equations: implicit function theorem, adjoint method, practical considerations
  2. Optimization problems: extending results from the algebraic equations, equality constraint, inequality constraints, conditions for existence of derivatives and diagnostics
  3. Ordinary differential equations: full Jacobian, adjoint method, tricks for users
  4. Action of matrix exponential: adjoint method based on ODE case
  5. Differential algebraic equations: adjoint method
  6. Discrete sequences: adjoint method

We’ve already written a lot of this down, it’s just matter of putting it together.

2 Likes

I just saw this reference that may be useful:

https://dl.acm.org/doi/abs/10.1145/3382191

It contains adjoint formula associated to the basic linear algebra subprograms (BLAS)

2 Likes

@charlesm93 So do you have something more written about 1.-6. than what is currently in the Github draft? I would be interested in reading.

I haven’t made much progress. I have a draft written but it needs cleaning up and adjusting. Now that this is back on my radar, I’ll try posting some chapters.

Is there a specific section you’re interested in? I could point you to some good references or share relevant drafts.

Mostly I would be interested if you have some general formulation for derivatives of an implicit function. I guess this is related to the implicit function theorem. I am happy if you point me somewhere on that, or the algebraic equations, or optimization problems stuff.

The branch issue-7/implicit-functions has the chapter on algebraic equations with two derivations: one that uses the implicit function theorem, and one that uses the more general adjoint method. You can check out this branch and then compile the AD-handbook to read the chapter. The cool thing about this approach (pedagogically) is that algebraic solver provides a very simple example of the adjoint method. The traditional case of ODEs is more complicated.

This post covers how we should differentiate the algebraic solver in Stan. @betanalpha, @vianeylb, and I review the ODE case in this manuscript and provide further references; I’d said the benefit of this review is that we do the derivations in great details. Other papers I like:

I have work on constrained optimization, but I haven’t shared this yet. It still needs refinement.

5 Likes

could you point me to some good references or share relevant drafts on 2.optimization?