Yep, it is working. The 188507 s with AD became 73762 s with analytic Jacobian. Will see what MPI shows.
OK, numbers are intriguing:
679 m - analytic Jacobian, MPI, 8 cores - 8 patients, NCP
2196 m - AD, MPI, 8 cores - 8 patients, NCP
2764 m - AD, MPI, 8 cores - 8 patients, centered parametrization
Looks good. How many states and how many parameters are you fitting?
34 states 10 parameters
That’s not how IP works for code. You own the copyright to what you write. Now whether you have the right to distribute it depends on whether you have the proper licenses to the pieces you used from elsewhere. Other people’s ability to use your modified code legally requires you to license it.
That will cause fewer log density and gradient calcs per iteration and will lead to higher effective sample size per iteration. Everything needs to be measured in ESS/second, since that’s the measure of the effective sample size you get when you’re done.
Could we add a new ode_integrate function that takes in a second function to provide analytic derivaties?
I’m not so worried about people making mistakes and would like to make this easier since these models are so time-bound.
Yes! @wds15 and I briefly talked about that at the end of last week’s meeting. I haven’t gotten around to writing a good issue for it.
For the case with known initial conditions the only place where analytic jacobian is used is rev/arr/functor/coupled_ode_system::operator() and rev/math/functor/cvodes_ode_dat::jacobian_states. If interested I have code which replaces both methods with the analytic jacobian calculation.
Actually 34 states and 100 parameters including random effects etc.
We would have to provide two additional arguments, I think. One for the Jacobian wrt to states and one wrt to parameters.
This would be very useful indeed as I am seeing 2-3x speedups from this relatively consistently. Before embarking on this we should try forward mode for taking the nested derivatives during ODE integration. Maybe this cuts down the additional burden already.
Another concern with user-supplied Jacobians is their correctness. Checking correctness for every call is not practical… but we could probably run a check for a single time using similar techniques as done for MPI caching of data. An additional complication for the checking is that we can only check specific values rather than equations.
Maybe we should instead think about a way to supply the Jacobian on the C++ level in an easy way? Right now it is a bit tricky to do so. However, making this easy in the C++ level makes this feature available to people like @linas who really want this and one can assume that the given Jacobian is correct. In a next step this can be combined with code-generated Jacobian’s (so using R to generate the symbolic derivatives, for example).
C++ interface would be awesome. My suggestion is to provide a C++ hooks to the calculation of f (system of ODEs), Jy (jacobian on states), and Jtheta (jacobian on parameters). f is currently on AD but there is no reason to use AD when analytic jacobian is supplied. I also saw 3x speedups. On the other hand, just because of a single user, undertaking such job may not be warranted. There is a way to interface C++ already although it is hard to use (at least for me). With FDA issuing guidance on PBPK modeling https://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM531207.pdf I expect boost in PBPK modeling in general and having Bayesian component would be awesome.
So, finally I found the time to implement a version of the coupled_ode_system which you can easily feed with the analytic Jacobians. This version cannot be made the default in stan-math since you get slow-downs in case you rely on AD. However, when you provide the Jacobian in analytic form then I am seeing a ~2x speedup for a oral 2-cmt ODE system (3 states, 4 parameters). The code is linked below and is part of an R script which auto-generates for simple algebraic ODEs the analytic Jacobian in full automation.
The timings which I am getting for an example with 30 subjects on a 6-core machine are:
serial, analytic: 167s serial, AD: 343s parallel, analytic: 48s parallel, AD: 99s
The relevant C++ code is in the
integrate_jacobian_analytic directory along with the R utilities which auto-generate the symbolic Jacobians and all required includes.
The automatically generated C++ files can be used with cmdstan as well as rstan (for that to work one needs to align the model name accordingly).
I hope this helps to get your, @linas, large models up and running in much less time.
Thanks a lot. The version I have provides 3x speedups but I did comparisons only with MPI. I also removed AD calculations even when calculating the states (I assume there is no need once analytic Jacobian is provided). I am sure your code will do even better. I am wondering if our efforts could be integrated with default stan - for example a switch when user wants to solve system of ODEs without using AD.
Running things with MPI will speed it up even further (maybe ~20%). Overall I find it really nice to see this 7x speedup which is about 2y of work :).
I agree with you that it would be nice to have the code I posted above in the core stan-math and then switch to it as needed. I don’t quite see how to do that elegantly… tagging @syclik who may have an idea of how this could be done (one way would be a compile-time switch or a contrib directory, but that is all not too elegant). Still… 2-3x speedup is something you would usually want when you are already suffering from ODE solving…
I think it would be fine to add something to Stan math to do this. I’d hope you can just plug-and-play the nested autodiff and analytic Jacobians here—that’s what we’re talking about, right?
As to improving the language, I think we’re going to need to wait for multivariate returns, because in general, when I have a function:
T0 f(T1 x1, ..., TN xN, ...);
for differentiable inputs
xN, what I need for output is:
- derivatives w.r.t.
n-th arg, which are of type
(T0 x Tn)whatever that means
This is simpler for the ODE case where the parameters and state variables are all flattened into arrays. We might be able to come up with something for that which packs things into a single output structure—the problem is that we don’t want to have to do it in two function calls (one for value and one for derivatives).
Probably the answer is yes. Including this would be great… in a bit more detail what I did:
- Refactored the rev stuff for
- This refactor introduces the functions
So when you use the refactored
coupled_ode_system then all you need to provide is a C++ function specializations for those two jacobian functions (the default implementation uses AD).
The crux is that the original
coupled_ode_system is quite a bit faster since it directly implemenents the Jacobian-adjoint vector product (and never stores the full Jacobian at all; it only stores a row of the Jacobian at a time). For this reason we cannot make the refactored
coupled_ode_system the default, since in the AD case the non-refactored version is faster. Whenever you now provide in analytic form the Jacobian, then the new version wins.
We could introduce a compile time switch which turns on the refactored version if requested. Thus, the user would need to request the refactored
coupled_ode_system version and provide specializations for the two jacobian functions to enjoy the speedup.
Does that sound doable? Maybe a good item to discuss at a meeting.
Yes, I came up to the same conclusion. If analytic jacobian is provided then there is no need for AD for the ODEs - AD is still required for everything else. If no analytic jacobian then just use AD throughout. It is more a switch what mode to use for ODEs vs. trully plug and play.
I am right now experimenting as to what is relevant for a new integrate call with Jacobian support. So I took the example above and tuned it a bit more. The new version avoids any additional copies and performs the matrix operations in-place and I discovered the
noalias command from Eigen to speed up matrix-matrix multiplication.
Using this I get (this time my AV scanner is off):
serial, analytic: 73s serial, AD: 252s parallel, analytic: 32s parallel, AD: 78s
Thus, I am getting now an amazing 3.5x speedup which I think is a major step forward (almost 8x with threading). I will update the github shortly with the code. It looks to me that we need to design any new facility along these lines, let’s see. If so, then we would need to introduce in Stan for the first time non-const by reference arguments to a user function if that is possible (and really needed, but I think so).
noalias call from Eigen seems to be important for good performance… have we discussed using this more often? See here: http://eigen.tuxfamily.org/dox/TopicWritingEfficientProductExpression.html )