Hey folks! Super excited that DAEs and the integration with IDAS was added in the recent release by the brilliant @yizhang .
We’ve been putting it through its paces already with a large-but-reasonable DAE system (like ~200 states including initial values). As @wds15 mentioned on this thread a couple years ago, the adjoint method is really required to resolve the speed issues for large systems. Is this a priority for anyone right now? In looking to see if it’s something we could just take on internally, I think its a bit of a bear. Those on this thread are much more the experts. Would it help though if we were able to provide a small “feature bounty” in the form of dollars?
I’m lost on the “including initial values” part. Is the total number of equations 200, or is the number of parameters 200?
Stan parameters only like 5 for now, 200 equations indeed, sorry for the lack of precision there.
In that case I’m not sure how much dent adjoint solver can make. Potentially the speedup can be bought from other angles. Within the DAE solver there are things like matrix & nonlinear solver structure to exploit, and there are possibilities in the Stan modeling level. I understand the model is not to share, has it been run with a small number of iterations or even fixed param?
I guess it seemed to us in the code like forward mode basically has to compute this N x N matrix (so 200 x 200 here) of sensitivities that later get chain-ruled into the gradients of target that Stan cares about … Empirically, if we run it (IDAS-via-Stan) with Zero parameters, 1 Stan iteration is ~15 minutes for this model. But if we run that same model through IDA (as in, no-S-as-in-IDAS), it takes < 1 second, so we were thinking that those IDAS sensitivities are the source of the slowdown. And reverse mode is supposed to speed that up from N^2 → N?
The forward mode actually solves an 1000x1000 system in addition to the 200x200 system, as for each state we need solve another 5 sensitivity ODEs. In the adjoint solver run we’d solve the 200x200 twice (with some other complications that could impact performance). So I don’t think the difference in IDA & IDAS performance is mainly caused by the sensitivity calculation, especially your IDAS run was not using any parameters.
It’s possible that the vanilla control I coded in Stan is not efficient in your problem. One thing your team can do is to replace the IDAS controls in Stan with what’s used in IDA to see if it makes a difference, as without any parameter sensitivity calculation, IDAS is idential to IDA.
Oh wow, that’s so much longer!
Of course, you are free to fork the repo and add a bounty. In terms of the Stan-org, we have never tested this but I think it’s something we can discuss. Can you add a stan-math issue for what you want added to Stan and ask about adding a bounty in the issue? Are you thinking using https://issuehunt.io/? Or something else?
The other option is to pay one of the Stan-developers directly. I’m not sure we have a standard procedure for this, we should though!
@SGB I’ll add this topic of adding features by paying for developer time or bounties for discussion.
Edit: Issue is Funding Stan features · Issue #7 · stan-dev/sgb · GitHub
Thanks for the new funding issue. Cool!
Back in IDAS / DAE land…
In case anyone happens to know the answer, I think the part we’re sort of stuck on in implementing the adjoint sensitivity version of IDAS, is exactly how to implement equation (2.23) from sundials/idas_guide.pdf at 6ddce5d90084d8d1cbb8e12bb5a4402168325efe · LLNL/sundials · GitHub
It does look like this is parallel to the current CVODES implementation for Stan ODE adjoint mode (as opposed to, implementing equation (2.22), though that is also a question)
Specifically our question is that, (2.23) involves terms with both of \lambda and \lambda_T, but I think when we configure IDAS to run, that involves the choice of asking it to solve either (2.24) which gives us \lambda_T OR asking it to solve (2.19) which gives us \lambda, but it is not apparent to us how to translate between the two, at least not without a matrix inverse which is not defined in all cases.
(There is also the matter of how to compute y_p in general, but that’s not a showstopper for us right now)
It seems that term out front is important to get the derivatives of the Stan log density values correct w.r.t. the algebraic equations specifically, as the other tests pass. A version with a matrix inverse (which fails for other tests in general but…) for that first term of (2.23) does get us to within a factor of 2 for the algebraic equations’ tests, but no cigar.
The whole point of adjoint sens is that we can avoid solving for the integrand in 2.19 directly if we obtain \lambda by solving 2.20 (or get 2.24 by solving 2.25 if dg/dt is seeked instead of dG/dt.) We achieve this by solving for y forward in time and \lambda backward in time. \lambda is a vector (one for each state equation) and there’s no matrix inverse involved (except internally in the solver when solving the DAEs).
From 2.23 you’ll see that there’s no need of y_p except at t=t_0, so we won’t be solving y_p.
I’m still curious about the cause of the performance difference between IDA and IDAS when no sensitivity is involved, in that case we are just solving for y in the forward pass so the 15min vs 1 sec observation is worth much investigation.
Have a look at the adjoint ODE design doc:
https://github.com/stan-dev/design-docs/files/6274031/0027-adjoint-ode.pdf
I did try to align CVODES and Stan-math notation as good as possible. Maybe that helps.
here is the PR: