I’d have thought the csr_matrix_times_vector2 would have analytic gradients and need that distinction anyway. Or does it only do analytic for double matrix and var vector?
double CSR matrix times var vector was slower than double CSR matrix times double vector with precomputed gradients, so I commented out that specialization in the .hpp file.
But the point is that now we don’t have to overpromote and we don’t absolutely have to have analytic matrix calculus.
I don’t even see a specialization in rev, only the top-level implementation. I can see from looking at the code why it’d be hard to get any speedups as it delegates each output to a dot-product calculation, which is pretty optimal as written. The only saving would be in avoiding some intermediate copies, but those aren’t so bad compared to all the other ad-hoc indexing going on (very hard to cache). Really there’s nothing else to be gained other than avoiding some big matrix-sized copies and allocations which can be avoided. It could perhaps be optimized by pulling all the chain calculations into a single node, but that won’t be that big a savings and it’s very complicated.
The most obvious speedup is to avoid this pattern:
If result_t is var, then you get idx number of allocations on the autodiff stack which are quickly replaced by idx copies of zero. What you really want to do is this:
auto b_sub = rep_vector(result_t(0), idx);
Or you could spell the whole result type, which doesn’t change.