I don’t even see a specialization in rev, only the top-level implementation. I can see from looking at the code why it’d be hard to get any speedups as it delegates each output to a dot-product calculation, which is pretty optimal as written. The only saving would be in avoiding some intermediate copies, but those aren’t so bad compared to all the other ad-hoc indexing going on (very hard to cache). Really there’s nothing else to be gained other than avoiding some big matrix-sized copies and allocations which can be avoided. It could perhaps be optimized by pulling all the chain calculations into a single node, but that won’t be that big a savings and it’s very complicated.
The most obvious speedup is to avoid this pattern:
Eigen::Matrix<result_t, Eigen::Dynamic, 1> b_sub(idx);
var, then you get
idx number of allocations on the autodiff stack which are quickly replaced by
idx copies of zero. What you really want to do is this:
auto b_sub = rep_vector(result_t(0), idx);
Or you could spell the whole result type, which doesn’t change.