Hey, Bob and I have been occasionally talking about potential compiler/query-style optimizations for the autodiff expression, things like common subexpression elimination, node fusion, and a variety of peephole optimizations where we can recognize a pattern and replace it with something more efficient.
Tensorflow recently released an experimental library called XLA that does some of this kind of stuff, plus JITs to CPU or GPU. I haven’t looked into it that much but I think it’s solving an extremely similar problem to the Stan math library and has many more resources, so it might be interesting to learn about what they’re doing and implement some of their ideas.
Here’s Bob’s comments on this stuff:
I see two angles there. The JIT stuff and the peephole
optimization for folding compound operations. Certainly
we can do the latter, which will help us optimize more
naively written programs (not very good for bragging about
model speed, but probably good for users). Some of the
JIT advantages we get from C++ static analysis, but most
we won’t get or couldn’t use like the GPU stuff (at least
until we find someone to plumb in the Eigen GPU stuff
under our matrix ops).
If you read the Adept autodiff paper, you can see how
they use template metaprogramming to do something similar,
but one level of abstraction lower (it’s more like it
runs reverse mode statically over a local graph).
There’s a third angle, which is analyzing the entire
graph of operations for things like parallelism or sparsity.
I assume TensorFlow does a lot of the former, and there’s
a very large autodiff literature on the latter. Both can
be very intractable operations to solve exactly, but
presumably there are useful heuristics like for other
Do you have any idea how TensorFlow does autodiff? I’d
love to benchmark what they have versus Stan and also
There is a former QMSS student who is interested in doing this and took an actual class on GPU development.
Point him to discourse or my email, would be happy to help him figure out how to get started.
@seantalts @bgoodri Has there been any progress on this?
@rok_cesnovar and I have also now started discussing how to optimize/parallelize autodiff. In particular, in combination with the GPU.