Eigen-AD: Algorithmic Differentiation of the Eigen Library

This was posted on Eigen’s mailing list this morning. Looks promising.

---------- Forwarded message ---------
From: Patrick Peltzer
Date: Tue, Dec 3, 2019 at 7:56 AM
Subject: [eigen] Eigen-AD: Algorithmic Differentiation of the Eigen Library
To: eigen@lists.tuxfamily.org

Dear all,

We would like to draw your attention to a new fork of Eigen with support
for Algorithmic Differentiation (AD) in the context of AD by overloading
in C++. Important new features include highly efficient first and higher
derivatives of Eigen solvers as well as compatibility with a wide range
of AD software tools. See

[1911.12604] Eigen-AD: Algorithmic Differentiation of the Eigen Library

for details including benchmark results. Please contact
info@stce.rwth-aachen.de for access to the software.

We look forward to further discussions with potential users and other AD
software tool developers.

With kind regards

Patrick Peltzer, Johannes Lotz, Uwe Naumann
Informatik 12 (STCE), RWTH Aachen University, Germany

6 Likes

Neat! Gonna make some notes while I skim this over. I think this is written by the same folks who did the complex number in Stan talk last Stancon

Therefore, replacing the return type of the functors with the auto keyword allows the passing of expression types from the AD-O tool to the Eigen internals.

I think Eigen doesn’t require C++11 so doubt this would get backported without some macro magic. This section has some nice gotcha’s with Eigen and auto we have ran into in the past

template<typename _MatrixType , typename e n a b l e _ i f   =  void>

nooooo, not a good name!

Some of their solver stuff will be added to Eigen 3.4!

If they open source their actual implementation we could look at how they use the new eigen_adsolver_enable type in Eigen.

Oddly, they talk a lot in the beginning about the Eigen solver stuff but then in the benchmarks don’t do comparisons for cholesky AD, which I feel like is sort of the base line solver.

Was confused by this sentence:

dco/c++provides a constant copy-overhead which is independent of the differentiation order; i.e. no matter what order, only a single copy operation to the passive data type is required.

Like the initial data is only passed over once? Or they somehow avoid making more than one copy inter-operation? For (1), yes that’s cool and good. For (2), idk how you go about that.
.

Regarding the third point, the solution vector can simply be saved in the callback object. Inorder to provide the decomposition in the adjoint run, it would be possible to save the inputmatrix in the augmented primal run and then decompose it again in the adjoint run. However,this would introduce a run time overhead of O(n^3) in the adjoint run. Instead, it is beneficial to save the whole decomposition.

If I’m reading this correctly we do the same thing in cholesky_decompose’s vari specializations, unless they went a step further and made the eigen matrices stateful.

Since the AD implementations quickly exceed the amount of available RAM, they are not suited for higher input dimensions and no data is available for n >1000.

Maybe I’m just dumb but I also don’t understand this sentence. idt we do what would be considered symbolic diff and we do pretty large matrices so I am confused why they can’t do n > 1000. heck, our OpenCL stuff doesn’t even kick in until you get to N,M > 1000

Ah, okay on page 11 they start talking about us

However, it must be noted that the Stan Math Library does not offer an API for advanced AD users, e.g. lacking a callback system or the possibility to compute arbitrary adjoint directional derivatives in the adjoint mode, only offering a grad() functionality.

If their thing can do directional derivatives this is the first place they mention it in the paper. Can’t we do directional derivatives in forward mode?

Although the Stan Math Library can be applied to arbitrary code, we see its advantages rather on the applications it was especially designed for, e.g. linear algebra using its own API and storage types from Eigen. dco/c++ on the other hand is aimed to be used for more generically written code and therefore requires less modifications or restrictions to be applied, while still providing a good performance. This can be inferred from Figure 10, where std::vector was used as the storage type for computing

Would be curious to see why we suck so much on that simple std::vector example. If their AD type they use builds up it’s own expressions that would probably be enough to explain the difference.

Stan math is made for a very specialized purpose, but I don’t really see any of their reasons it’s not generic enough as terribly valid. That’s not to say there are not reasons, there are some very good reasons it’s not generic. Like idk having a stack allocator with a bushel of pointer chasing on the chain methods and having to initialize the AD tape which is kind of annoying and never deleting stack allocated memory. But those are pretty known tradeoffs between performance and generic-ness

Stan provides a mdivide_left(A,b) function to solve a system of linear equations which will always use the ColPivHouseholderQR decomposition internally to solve the system. Therefore, the algorithmic and the dco/c++/eigen measurements displayed in Figure 13 utilize this solver class. Although Stan also evaluates the adjoints symbolically, the implementation of dco/c++/eigen presented in Section 3.2.1 is faster. The reason for that is that Stan performs another decomposition in the adjoint run, while the implementation of dco/c++/eigen keeps the decomposition from the augmented primal run in memory and reuses it as described in Section 3.3

Huh, could we just add the precomputed decomposition to the vari specialization of mdivide_left so we don’t compute it twice?

The current API implementation of Stan limits the user to prescribed underlying decompositions like ColPivHouseholderQR and FullPivHouseholdeQR as stated above.dco/c++/eigen on the other hand allows the utilization of any of the supporting Eigen decompositions.

Very valid, I’d be interested in seeing their integration! We could probs utilize a form of it with our var type

tbh it kinda looks like they cherry picked a few things, but like I get it ya know that’s whatever. I like papers where we get called out, it’s like we are the big bad guy / final boss. Gives me another reason to wear sunglasses indoors.

It would also be nice to see precision comparisons between the calculations of the derivatives. They should be v v similar for simple things, though cholesky and other solvers could be wily.

1 Like

Oh, we also have mdivide_left_ldlt

Thanks @syclik and @stevebronder for your interest in this.

I can confirm that this paper is related to the complex number talk from StanCon.

Extending Stan’s Automatic Differentiation capabilities using dco/c++ - https://www.nag.com/market/stancon-nag.pdf.

I used the Eigen/dco branch to generate the results in that presentation.

As I said in my presentation, the Stan Math Library does an amazing job of making AD user-friendly. In my view, the dco tool is complementary in the sense that it can be applied to a wider range of codes.

I would be interested to hear from anyone who has a C++ code for likelihood evaluation and would like to embed that code within Stan’s MCMC sampler.

2 Likes

Hi @philipm, in 2018 I wrote a fully test-passing complete complex number implementation for stan available here:

git clone https://github.com/stan-dev/math.git
cd math
git fetch origin pull/789/head:sometempbranchname
git checkout sometempbranchname

I describe its architecture a bit in this post (by comparing it to something someone else is working on). Here is a test showing it 2nd order minimize (Hessian * -gradient) the difference between the real and imaginary parts of the first eigenvalue of a rotation matrix. It forms the basis of another program I use at my workplace for offshore cranes.

I have an interest in automatic differentiation with Eigen, because the expression template types (with ScalarBinaryOp<...>) stored in auto expressions (which Eigen says we shouldn’t use, but I don’t listen) might possibly be walked to generate the ‘bulk’ adjoint graph update function (applied on write to a storage type like Matrix<var,...>) at compile time rather than needing to be hard coded… I’ve just never attempted it.

As requested above, I’ve sent you an email at info@stce.rwth-aachen.de from my work email address for access to the software.

Do you have any plans for mixed forward and reverse mode calculations, similar to stan’s fvar<var>?

1 Like

Thanks Phil!

If yinz are able to share your code I’d be very interested in seeing the design! Though if it’s not going to be released under a BSD conformable license then I’m not sure I can look at it

Also ty for adding the work to Eigen! Once 3.4 is released we will definitely be looking at the solver stuff

The current version of dco has Tangent-of-Adjoint mode for computing second derivatives. Does that answer your question @ChrisChiasson?

The references to the complex number implementation for the Stan Math library are useful as this is something I have looked for in the past, and not found.

@stevebronder - send an email to info@stce.rwth-aachen.de if you would like to get access to the code and/or get further details of licensing.

1 Like

Yes, thank you. This is used / usable inside Eigen-AD? BTW, is this something where the library is open source and you sell support for it? Or, is the source not publicly available (as this would stop us from using it)?

Yes, Tangent-of-Adjoint mode can be used on the Eigen-AD / dco branch. I have asked Patrick and Uwe what their plans are around releasing the software, but I have not had anything confirmed yet. I had a short email from Uwe (the group leader in Aachen) saying he is currently travelling and he will respond to the requests that have been coming in early next week.

Hi,

I really appreciate your interest in Eigen-AD. Also thanks to @philipm for his responses so far. I hope that I can add some more info:

First of, I agree that sharing the source code is important for a detailed discussion. As Philip said, we are working things out at the moment and hopefully will be able to provide you with the Eigen-AD repository next week.

Now, some responses to @stevebronder notes:

You are right about that; the auto-return-type-deduction is the most intrusive change to the Eigen code, and I don’t think it will be backported in its current form. It might be possible to rewrite it using a struct, but that would require to explicitly define all used expression templates. I guess that’s something to work out together with the Eigen core developers.

That pull requested adjusted the solver class structure accordingly. It is still missing the important hooks to effectively specialize the solvers. I think that a future pull request should include the most important entry points added in Eigen-AD (for solvers and products).

The cholesky solvers (LLT, LDLT) are displayed in Figure 9 in the arXiv paper; they showed the same behaviour as the other solvers for run time and memory usage. They weren’t used for the comparison with Stan but only mdivide_left (ColPivHouseholderQR).

(1) is correct; higher order code uses the symbolic implementations recursively where always the same memory is used for the AD datatypes of different order. Only the lowest level copies the primal values to the passive datatype.

I think you misunderstood us here. We were talking about the non-symbolic, i.e. scalar level AD implementations. So if you just directly apply an AAD by overloading tool to the standard Eigen library, you will most likely fill up your RAM quite fast. That’s why the data labeled as ‘AD’ in the corresponding figure ends at n=1000.

My benchmarks always seeded all output adjoints with ones and evaluate the first order adjoint model one time. I couldn’t find a similar feature in Stan, but only a grad() function which will obviously evaluate the adjoint model several times. I therefore changed the Stan benchmarks to always compute a single scalar value z (by summing up the original output values), effectively evaluating the same computation.

Yeah, dco/c++ uses its own expressions.

When I looked up your code, you explicitly instantiated another ColPivHouseholderQR instance in the reverse run, which will add a n^3 run time overhead. In the shown Eigen-AD or rather dco/c++/eigen implementation, the solver instantiated in the forward run is kept alive for the reverse run. This way, the decomposition is saved instead of only the input matrix.

As mentioned in the paper, we acknowledge that Stan has way more symbolic features than Eigen-AD has at the moment. Regarding AD-O tools, Stan by far provides the most features for linear algebra with Eigen (atleast from what we found). It therefore made sense to compare with you guys.

Our testsystem compares the symbolic adjoint features with the AD tangent versions at the moment, and they mostly match quite well. The SVD solvers have been observed to give slightly less precise tangent results, but yeah, that topic requires some more work.

@ChrisChiasson Using the Eigen expression tree for compile-time differentiation is indeed something we have though about, but we haven’t spent any time on it yet.

4 Likes

Thanks Pat! Apologies if my note above came off the wrong way at all, I was on an airplane and was just sort of skimming through.

Yeah the only way I could figure this is if eigen had some macro like

#ifdef SOME_CXX11_VER_CHECK
#define return_type(x) auto
#else
#define return_type(x) x
#endif
//... l8r
return_type(Eigen::DenseMatrix<T>) DenseMatFunc(...)

or something of that ilk

Yes my apologies, I think my intent with that comment was that I’d personally like to see benchmark comparisons between the stan’s Cholesky and yours (we spent a good bit of time on it haha), not to say that it detracts from the paper itself

Ah ic yes that was my misunderstanding. It might be good to be more explicit or have a footnote. Though I’m not an expert so if that verbage is the norm then all is well.

Yeah it looks like it might be an issue algorithmically with how we do mdivide_left we should probably flag that and look into how to be more performant

Yes ftr I’m very happy about your folks work! I was pretty excited to see the benchmarks, feature rich but we definitely have places to improve.

The solver precision is a tough cookie, for the OpenCL stuff we just based our Chol implementation as X close to our current version of the cholesky derivative and Eigen’s chol stuff. But eod that felt unsatisfying. Not related to the paper, but it would be nice to have a better way to generate good / bad matrices, have some rigorous solutions computed, and compare against those for a bunch of the solvers. Feels like there could be insights to gleam from that

Overall very cool though, lmk when/if you can share!

1 Like

That’s right that they haven’t jumped to C++11 yet. They made a fork of Eigen.

Yes, and now the forward mode is thoroughly tested, so we can deploy it. The only problem is that it won’t work for our solvers like ode_integrate, etc., because we haven’t implemented fully templated or forward-mode for them (at least I don’t think we have templated enough operations to do forward mode).

Any idea what they mean by this?

What’s the example? std::vector shouldn’t be any different than anything else.

What’s not generic? Sorry I’m being so lazy and letting you digest this.

You need arena-based memory management to do reverse-mode efficiently. And you need pointer chasing in chain() or some equivalent because there’s no way to do arbitrary expression types statically (not that everyone coming into the project doesn’t try at some point before they realize the problem). So there needs to be at least one pointer chase. Now what we can do with adjoint-jacobian-products is make those so they’re not one pointer chase per var, but per structure.

Yes, and we should if it’s not already there. The tricky part for us is embedding arbitrary Eigen types on our stack allocator, which only does malloc-style contiguous memory allocation and never calls destructors in the usual case. There are some internal ways of putting things on a stack that will call destructors and I can help if anyone wants to go down this route. But I was really hoping we could get rid of that extra autodiff stack.

:-) As Andy Warhol said,

“Don’t pay any attention to what they write about you. Just measure it in inches.

I probably should’ve looked this up before trying to tackle std::complex specializations and overloads.

Indeed. We don’t want to be sued by patent trolls with proprietary software. Now I’m not accusing anyone of being a patent troll here—we just need to be careful as a big project. Also, as Steve says, when things hit Eigen open source, we can use them. So thanks for that.

I don’t understand the functionality. What does it mean to set multivariate output’s adjoints to 1? What is the resulting adjoint on the independent variables?

Stan has a pretty full set of functionals, including:

  • stan/math/fwd/mat/functor/gradient.hpp

  • stan/math/rev/mat/functor/gradient.hpp

  • stan/math/fwd/mat/functor/hessian.hpp

  • stan/math/mix/mat/functor/hessian.hpp

  • stan/math/fwd/mat/functor/jacobian.hpp

  • stan/math/rev/mat/functor/jacobian.hpp

  • stan/math/mix/mat/functor/derivative.hpp

  • stan/math/mix/mat/functor/grad_hessian.hpp

  • stan/math/mix/mat/functor/grad_tr_mat_times_hessian.hpp

  • stan/math/mix/mat/functor/gradient_dot_vector.hpp

  • stan/math/mix/mat/functor/hessian_times_vector.hpp

  • stan/math/mix/mat/functor/partial_derivative.hpp

The versions in fwd are less efficient, but are fully differentiable.

This is a memory overhead/speed tradeoff, and we should probably make it now for this function and others. When I first wrote our autodiff, I tried very hard to set it up so we could implement adjoint-Jacobian products efficiently and lazily. Back when we autodiffed through algorithms like ODE solvers, it was much more important to minimize memory usage.

I’m not sure what this means. Stan calculates analytic partials for each function, but then the entire system runs by autodiff, not symbolic differentiation of an expression (like, say, Theano does).

Absolutely. I would much rather do this than drop in random tests and hope we hit edge cases eventually.

1 Like

I’m currently finishing up the techguide for Eigen-AD and I can then provide a link. Just for clarification:

Eigen-AD is NOT a full blown AD tool. It is a framework for handling Eigen within AD (by overloading) tools.

The results from the arXiv paper were obtained using the commercial AD tool dco/c++ developed by NAG Ltd. in close collaboration with our group. dco/c++ uses Eigen-AD. The latter is not specifically aimed at dco/c++. Other tools can - and hopefully will - use it.

The Eigen-AD code will be published under the same license as Eigen (MPL2). It does not contain dco/c++.

We appreciate your feedback and are happy to further discuss the design and changes in Eigen-AD. A link to the repository will be posted within a few days.

1 Like

That’ll only be useful if it’s open access.

That’s great news. Thanks for the clarification. I realized anything that got into Eigen would be under their license but I didn’t realize that’d include most of your Eigen-AD improvements.

The Eigen-AD repository is now public and can be reached here.

See README.md for some hints where to look for documentation.

5 Likes