Hey, Bob and I have been occasionally talking about potential compiler/query-style optimizations for the autodiff expression, things like common subexpression elimination, node fusion, and a variety of peephole optimizations where we can recognize a pattern and replace it with something more efficient.
Tensorflow recently released an experimental library called XLA that does some of this kind of stuff, plus JITs to CPU or GPU. I haven’t looked into it that much but I think it’s solving an extremely similar problem to the Stan math library and has many more resources, so it might be interesting to learn about what they’re doing and implement some of their ideas.
https://www.tensorflow.org/versions/master/experimental/xla/