@anon75146577 here’s the addition operator you ordered! @Bob_Carpenter – you’ll be interested in the benchmarks at the end.
struct AddFunctor {
template <std::size_t size>
Eigen::VectorXd operator()(const std::array<bool, size>& needs_adj, const Eigen::VectorXd& x1, const Eigen::VectorXd& x2) {
check_size_match("AddFunctor::operator()", "x1", x1.size(), "x2", x2.size());
return x1 + x2;
}
template <std::size_t size>
auto multiply_adjoint_jacobian(const std::array<bool, size>& needs_adj,
const Eigen::VectorXd& adj) {
return std::make_tuple(adj, adj);
}
};
This is compilable from the current stan-dev/math develop branch.
This is equivalent to the prim implementation:
auto AddFunctorAutodiffed = [](auto& x1, auto& x2) {
check_size_match("AddFunctorAutodiffed::operator()", "x1", x1.size(), "x2", x2.size());
return (x1 + x2).eval();
};
And in the spirit of Checking That Things We Write Actually Work, I ran some benchmarks. I compared the prim implementation above to the adj_jac_apply implementation. I also coded up an “inefficient” adj_jac_apply that computes a full Jacobian as a comparison.
I expected the autodiff and efficient adj_jac_apply would both be fast, and the adj_jac_apply faster for large vectors (cause there’s way less chain calls). Turns out adj_jac_apply is about 20% slower than the purely prim implementation. I guess that means my processor is better at virtual function calls than I gave it credit for, or shufflying around the double-datatypes in adj_jac_apply is more expensive than I thought. These are the numbers:
The inefficient implementation is, of course, bad. And I guess this should just be a Warning that unless this stuff is used craftily, you can still end up slowing your code down:
I’m going to compare a prim vs. adj_jac_apply implementation of a more complicated function to get a handle on this (simplex_constrain, but I’ll get to that later). Looks like we’re gonna need to be careful when using this to make sure the complexity of our adj_jac_apply doesn’t exceed the regular autodiff! It’s sneakily efficient it seems.
Full test benchmark code is here: https://gist.github.com/bbbales2/a1689764f0fda6df561e858026f4e8d9