I don’t even see a specialization in rev, only the top-level implementation. I can see from looking at the code why it’d be hard to get any speedups as it delegates each output to a dot-product calculation, which is pretty optimal as written. The only saving would be in avoiding some intermediate copies, but those aren’t so bad compared to all the other ad-hoc indexing going on (very hard to cache). Really there’s nothing else to be gained other than avoiding some big matrix-sized copies and allocations which can be avoided. It could perhaps be optimized by pulling all the chain calculations into a single node, but that won’t be that big a savings and it’s very complicated.

The most obvious speedup is to avoid this pattern:

```
Eigen::Matrix<result_t, Eigen::Dynamic, 1> b_sub(idx);
b_sub.setZero();
```

If `result_t`

is `var`

, then you get `idx`

number of allocations on the autodiff stack which are quickly replaced by `idx`

copies of zero. What you really want to do is this:

```
auto b_sub = rep_vector(result_t(0), idx);
```

Or you could spell the whole result type, which doesn’t change.