MXNet white paper "Auto-Differentiating Linear Algebra"


#1

This arxived white paper Auto-Differentiating Linear Algebra
is relevant for us. Focus is in the MXNet implementation but there is some useful discussion of general implementation issues. Important parts of the code are in C++ and using BLAS + LAPACK for CPU and CUBLAS+CUSolver for GPU.


#2

mxnet.autograd is so convenience for me, it is very simple


#4

First of all, what makes you think they’re more efficient for the kinds of computations we do in Stan? For example, Stan is much faster than Greta, even though Greta uses TensorFlow for autodiff.

Second, I’m not sure what you mean by “dedicated framework for deep learning”. Are you talking about something like Keras that specifies deep neural networks, or something more flexible like Edward (general static autodiff from TensorFlow) or even more flexible like Pyro (general dyanmic autodiff from PyTorch)?

It’s almost always easier to optimize for a specific application. If you only need to do logistic backprop, the derivatives are all simple arithmetic and it’d be silly to use autodiff to calculate them.

P.S. If you were worried about Edward reporting being faster than Stan, that was all for a very simple logistic regression involving an easily parallelizable matrix multiply. As soon as Stan’s GPU and MPI code lands, we’ll be competitive and probably faster than Edward because we’ve optimized so much of our underlying code for statistics. Of course, they’re not standing still, and have all of our work to build on. So I expect this all to be fun going forward. I think most of the people building these systems are on good terms with one another.


#6

Not really. They also have a ton of overhead running through Python. And they haven’t specialized things the way we have for statistics functions, though it’s my understanding that Matt Hoffman and Rif Saurus’s group at Google are working on that as part of their Bayesflow project. That’s what I meant when I said I expect the projects to leapfrog each other.

They also run at 32-bit precision, so they’re not even computing the same results.

And TensorFlow isn’t flexible enough to compile Stan models. PyTorch would be if we stuck to their relatively limited built-ins.

Like I said above, Greta uses TensorFlow, and Stan is way faster. But that may just be sloppy R-to-TensorFlow coding. Stan’s already faster than Edward on a single CPU, too, at least judging from their reports.


#8

https://www.boost.org/doc/libs/1_68_0_beta1/doc/html/boost_yap/manual.html#boost_yap.manual.examples.autodiff

But it is unclear whether their expression templates will work with g++-4.9 because it relies on Hana which seems to use all of the tricks in the C++14 standard. Still, it is quite likely that RTools will bump its g++ version before we could rewrite Stan Math with expression templates.


#10

Yeah, Eigen uses expression templates and lets the compiler optimize away temporaries and other unnecessary loops. It doesn’t have much to do with auto though and the Eigen devs actually discourage using auto
http://eigen.tuxfamily.org/dox/TopicPitfalls.html
But I too am hopeful that if Stan could use expression templates then the autodiff could be made faster or at least if someone does

target += foo;
target += bar;

that it is no worse than

target += foo + bar;

#12

It’s hard to have a linked discussion when you delete posts.

That only accelerates single expressions, but yes, we could do that. It does it by doing autodiff in memory as part of the expression template reduction. It’s a kind of checkpointing that doesn’t need to go to virtual operations. Very cool, but it’s not clear it’s worth all the effort of implementation as we don’t usually have very complex expressions and it’s a ton of work to do in full generality. But if we had more time, we could probably reingineer our whole base autodiff to do what Adept does with templates. It would certainly speed some operations up substantially.

Adept’s been around since before C++14. It came out about the same time as Stan’s autodiff. It may have been updated more recently to break that backward compatibility.

In Stan, you can build arbitrarily complex functions with analytic gradients. That’s always the way to go if possible, ideally in an adjoint-Jacobian product implementation so the whole Jacobian doesn’t need to be precomputed and stored.

If you have static autodiff (i.e., no parameter-dependent conditionals) then you can make some progress with optimization. A whole lot of the autodiff literature is devoted to this, particularly in terms of sparse Jacobians and Hessians.

The tack we’re going to take is to optimize the expression graphs, becuase we can do that statically.


#13

That’s right, but it’s not fully general, in that not all expression templates play nicely with each other in Eigne. The block() expressions have been particularly problematic. But moving to auto for intermediates for us is a big win. We should also change how we pass a lot of matrix args to be more lazy about evaluation of expression templates. There are some mechanisms for this in Eigen, but it’s going to be a lot of work, and we’re shorthanded as it is given our to-do list.

I’ll have to take a look at this. Feel free to just jump in proactively and recommend things for us to read. You’re way better than the rest of us at ferreting this stuff out.

We don’t use static sizes other than 1 in Stan.

Not easily. Their expression templates are just for matrices. Stan’s would have to be way more general.

One thing you may be interested in is our MPI system that just came out in Stan 2.18 (CmdStan 2.18 is the only interface released so far—it takes PyStan and RStan a while to release after CmdStan these days and we decided not to hold up CmdStan releases waiting for the in-memory interfaces). That has the same kind of nice behavior of checkpointing (doing local reductions), which improves performance presumably by increasing memory locality.

I think you will find the former better than the latter. All target += does is put the argument into a stack which is then fed to sum() when everyting’s done. sum(x, y, ...) is more efficient than sum(x + y, ...).

I’d suggest they know what they’re doing and care is required. But as @bgoodri says, we’ve also been experimenting with some of this.

That can’t work—no way to infer the type of x no matter what the type of a + b is.


#14

I would have, but it (not Adept) requires more of C++14 than what g++-4.9.3 supports (I think). It does seem like the right way to go if one were starting from scratch.


#15

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)


#16

By concept, you mean Adept-style expression templates? Yes, that’s appealing, but not a priority given everything else we’re working on.

I don’t think there’s any way the compiler could figure out the algebra and memory locality to write efficient matrix derivatives.

The map function in 2.18 puts things on the stack, then evaluates it off them on the fly, but it’s not quite pushing things down to static operations.

I think you’ll find the adjoint-Jacobian framework that @bbbales2 is now implementing is going to make writing efficient matrix derivatives much easier. No temporary operations in the value calculation need go on the stack.