When solving an algebraic equation, we find y^* such that f(y^*, \theta) = 0 where \theta are our model parameters. We compute the jacobian matrix of y with respect to \theta using the implicit function theorem:
where J^y is the Jacobian matrix of f with respect to y, etc.
Currently, I compute J^y and J^\theta separately using Jacobian(), but it could be more efficient to compute \tilde J := [J^y, J^\theta] in one go. The cost should be roughly halved, since it’s the same amount of sweeps in reverse-mode autodiff.
Any objections to this idea? If not, I’ll submit an issue, fix it, and do the PR.
That’s the way to do it – build up the full Jacobian and then partition it into the y and \theta blocks. It’s not so much about the number of sweeps here but rather the shared computations.
That said there is the potential for some addition improvements. If you don’t compute everything at once then you could compute J^{\theta} v in a single reverse mode sweep. If you compute and invert J^{y} separately you could then compute J v with an additional matrix-vector product and implement the function using the Jacobian-vector product formalism.
One could consider going even further and using verse mode to compute products of the form J^{y} b for iterative solution of J^{y} b = J^{\theta} v but my guess is that the autodiff overhead would render that less efficient than just computing J^{y} at once and explicitly inverting.
There’s a potential for piggyback if Newton’s iteration is used to solve for y^{\ast}, since J^y is required anyway. When that’s the case, you don’t want to always compute J^y and J^{\theta} at the same time because J^y gets updated much more frequently. So maybe design accordingly based on the scenario, assuming Newton solver is still on the list.
@yizhang Newton solver is still on the list – I’ve just had a busy semester – , and Powell’s method also uses derivatives, so your comment applies there too. In the solver, J^y is computed at each iteration except at the solution, which is what we need for the sensitivities. The computation of J^y and J^\theta at the solution only happens inside the chain() method, so there shouldn’t be any redundant calculations.
@Bob_Carpenter
+1 for forward mode.
I’ll test the efficiency gain. When I say halved, I mean for computing the sensitivities, but at this point, most of the cost comes from solving the algebraic equation. I’ll check this too.
@betanalpha Calculating J^y v would work. Here v is the adjoint of y^* with respect to the log density, and is a member of the vari class. We would need a version of Jacobian() function that lets you specify an initial cotangent vector. Is there a way of doing this without digging to deep into Stan’s autodiff?
On the other hand, I’m not sure computing J^\theta and J^y v is more efficient than computing \tilde J. But worth trying out. What could also be valuable is computing J^y y^{(i)} in one sweep at each iteration of the solver.
I like the idea of sensitivity for J^y b = J^\theta v directly.
Yeah, this is all of the Jacobian-adjoint product infrastructure that @bbbales2 implemented. I thought there was a nice wiki page demonstrating the functionality but I couldn’t find it.
If only J^{y} y^{i} is needed for the Powell updates then you should definitely only compute that!
Relative efficiency can be worked out using the known scalings of reverse mode. If there are N states and K parameters then \tilde{J}, a N \times (N + K) matrix can be computed in N reverse mode sweeps. J^{\theta} v can be computed in one verse mode sweep. J^{y} would require N reverse mode sweeps, but they would be cheaper than the full reverse mode sweeps for \tilde{J} because the parameters would be treated as constants. Ultimately there’s going to be some transition for large enough K where computing the Jacobian adjoint product is more efficient than computing the full \tilde{J}.
If only J^y y^i is needed for the Powell updates then you should definitely only compute that!
Ok. I think the devil will be in the details here. If memory serves well, Eigen’s function for Powell’s method accepts J^y as an argument, but not a method to compute J^y y. But I need to break the code open, and take a closer look.
EDIT: you don’t need J^y y^i for the update, rather something of the form [J^y]^{-1}f(y^i) akin to what is used in Newton’s method. So computing the directional derivative at each iteration is not an interesting path.
EDIT 2: A small aside: Eigen does a lot of interesting manipulations on J^y (see the file HybridNonLinear.h in Eigen’s unsupported module). It notably finds it QR decomposition to compute [J^y]^{-1}f(y^i).
For the INLA optimization problem, I think we would gain much from computing Jv with the Jacobian-vector product formalism. Indeed we have J^y in analytical form. J^\theta on the other hand is computed in n sweeps, where n is the dimension of y. In practice y is high-dimensional so this computation is very inefficient. With the right initial cotangent, we can do this in one sweep.
Now, let l be the log density and f the algebraic system. I’ll use the convention that J^x_y is the Jacobian of x with respect to y. From the chain rule and the implicit function theorem:
where w is the wanted initial cotangent. I’m not sure how useful this is in general, but I’ll try it out on the specialized solver for the INLA problem.