I had a busy time recently as I was hoping to get a parallel ODE Stan version up and running on a real project quickly to see what it really buys me on a real-world problem (yes, those PKPD problems are just perfect for a “real-world” aka complicated benchmark).
I did take a model I quite care about and sub-sampled 160 units from the data set. The problem was run on 1 to 16 cores on our cluster. While the scaling with CPUs is not anymore linear, the practical difference of 3.3h for one core compared to just 20 minutes running time on 16 cores is huge.
Ok, now its on me start that design wiki page, but I thought sharing this news is worthwhile. Ah, and yes, I checked results for correctness, of course; all good.
BTW, I found out that I have to put together the AD variables in the same input order. By now I let the parallel for loop work asynchronously and simply store the double only result. Once all integrations are done, I do the decoupling operation in a single sweep. An earlier version did the decoupling operation asynchronous (but only one at a time); this did give me not exactly reproducible results which was not intuitive to me. I have updated the prototype branch on stan-math which shows how I do this by now, if you are interested.
This is really great. I like your solution to reproducibility, which I assume is still required for Novartis. That merging back into the expression graph is going to have to be serialized anyway. There’s some overhead in that now you can’t do any work on that until all the results are in, which can limit the pipeline considerably if the merging work takes any time at all vs. the solving, which is unlikely to be the case in ODEs (relatively expensive computation with very few parameters).
A bigger project would be a purely parallel autodiff. Autodiff requires us to propagate derivatives down from every parent node before passing derivatives down from the child nodes. But that leaves an extraordinary latitude for parallelization. I believe this is what TensorFlow is trying to do.
… the best thing about this is in fact that everything is fully automated. That is, the ODE is defined in the Stan model itself as a special comment. My stan_ode_model R function parses the ODE automatically, generates symbolically the Jacobians and writes these to disk. The returned Stan model from that function simply uses the code inclusion functionality and can run with a parallelized ODE solver giving massive speedups (I have it working for bdf and RK45). My code is research code at this stage and I did a few shortcuts like only allowing a single ODE per Stan model, but what is there is sufficient to learn quickly what matters (and one can use it already).
And yes, for Novartis in particular, and in general I think that things should stay exactly reproducible under the usual conditions; parallelism should not change that. What I have done is a simple-minded solution and can for sure be improved by some clever techniques which would make the serial part of building up the AD stack part of the parallel part of the program. I was shooting for the low-hanging fruit and it apparently worked out.
Reproducibility will be impossible for asynchronous algorithms like the approach to EP that Andrew’s been working on. It works on data in parallel and synchronization for reproducibility would really slow it down. But presumably it’ll be OK if there are inference techniques that are “off limits” to those that require exact reproducibility.
If a statistical procedure is not exactly reproducible when fixing data and seeds, then this is really bad news for any application in Pharma and probably many other domains as well.
I have been tuning the non-stiff ODE solver a bit and was able to speedup things by a whopping 2x vs what I had before. Essentially I am avoiding evaluating the ode rhs twice and no copying is ever performed. The attached results do demonstrate the speedup with respect to
a) the vanilla Stan reference which uses AD
b) the new code which can be run with multiple cores, but I only use a single core
Long story short: Case (a) takes 3.5h vs 1.88h with the tuned code on a single CPU. When throwing 16 cores on this, then I get the runtime down to 0.16h (10 minutes) or a 21.5x speedup!
I will apply these techniques to the stiff solver and then we will be ready for a decent ODE refactor.
Sounds great. That’s a two-times speedup for our existing code?
Is case (b) for solving multiple ODEs (such as for multiple patients) or parallelizing a single ODE solver? If it’s the former, we can think about either
a function call that could be parallelized from within the Stan language, or
allowing OpenMP-like parallelization statements in Stan code through to the C++.
The second seems more general. I don’t know how it works, though.
Yes, our existing code can be made 2x faster, I think. Currently the RK45 integrator calls the () operator of the coupled ode system. This () operator then executes twice the ODE RHS functor (one time to get the rhs itself and one time to get the Jacobian). The first call to the ODE RHS can be saved as the ODE RHS gets anyway evaluated when the Jacobian is being calculated. I think this is a very small change to the system and it is worthwhile to include it ASAP. I will prepare an issue and a pull request for this soon.
The speedups in my PDF are for solving multiple ODEs (same ODE, but different parameters such as different patients). We should definitely discuss how we bring this feature to users! From the two options you list, I think
A new integrate_ode_parallel call would be straightforward to add to the Stan language (provided we solve the problem that the ODE RHS Jacobian can be given in analytic form). We would endup providing a few functions to the user in a parallel version which would be highly optimized for openMP execution.
allowing OpenMP statements in Stan itself is probably doable. This would require that we mark all sections in the Stan codebase with “critical” and/or “master” tags. These would ensure that these codeblocks would never be run in parallel and only inside the master thread. This is probably hard to do, but maybe possible - i don’t know. The resulting parallelization would by far not be as efficient as option (1) on a first shot, but more general and once we figure out how to make things thread-safe (AD!) certainly be the better option in the long run.
The first call to the ODE RHS can be saved as the ODE RHS gets anyway evaluated when the Jacobian is being calculated.
This can be applied to a lot of problems. We’ve done it to some degree with the algebra_solver, though I think we could do it “even more”. A lot of jacobians get evaluated. I think there are one or two redundancies I can get rid of if I use a more object oriented approach.
This will speedup non-stiff ODE integration whenever the ODE RHS is costly to evaluate. I am getting a 50% speed increase on an example involving dosing (3.52h vs 2.38h). I think anyone can review this as the changes simply avoid duplicate evaluation of the ODE RHS. Having ODE knowledge is not really needed as far as I can tell (but won’t hurt, of course).
Costly ODE RHS are very often encountered whenever a forcing function is involved. In PK/PD models this happens when we include dosing or have a fixed model for the concentration of a drug in the blood and then solve an ODE to describe what the drug effect is.