I just used the first benchmark example I was able to grab. It clearly depends on the specific model how close you can get to the theorertical speedups. Pulling out an ODE model (from here), which I modified to turn the map_rect
parallelism off (code is attached below) gives me for 200 warmup + 200 iterations a run time of
- serial execution: 216s, 222s
- parallel execution: 162s, 162s
That’s a 34.7% speedup !
The huge argument for me is that users do not need to change any Stan code line in order to take advantage of this technique. As you see - the heavier the computational burden is to get the gradient, the closer we endup to the theoretical speedups. So the more you suffer from your model run times, the more you get out of this if you can afford to double the used resources.
Sure, I see the point here, of course. And we absolutely need to balance the maintenance burden vs the utility. Right now we have in our codebase a number of samplers already which begs the question why not one more?
However, it could be a strategy to simply rewrite the existing sampler in a way that it always uses the dependency flow graph from the TBB. There are not too many things which would then change (I think). Right now the sample code is divided into the transition
and build_tree
function. As I have written it right now there are almost no changes to the build_tree
function and the transition
method is changed from a while loop to setup the dependency flow graph.
With my current prototype I was trying out if one can match the serial execution performance by wiring up the flow graph in a serial way. Serial execution with the poisson example then is
- with the vanilla 2.20 code: 83s, 84s, 84s.
- with the flow graph wired up serially: 88s, 87s
So you see that the performance of the serial vanilla version is in reach, I think. Thus, it is potentially an option to have just one sampler code written which is refactored to use a flow graph.
The code is not yet up on GitHub. It’s a bit too rough at the moment. I will think about a refactor of the code and upload that.
What would be more helpful at the moment is to agree about the compromises we want to make.
Options are:
- don’t explore further as it’s not worth the trouble
- fully separate implementations to not burden serial performance in terms of code complexity and any performance sacrifices
- work out a single sampler which uses a flow graph as it’s way of working (your really don’t need to bother with threading if we use the TBB; the dependency flow graph fully abstracts away the threading bit).
This is not about personal things, to be clear. In my opinion a 33% speedup on ODE problems (or any other heavy problem) is fantastic given that there is 0 time needed by the Stan modeler to get this speedup; all what is needed is a large cluster with lots of resources. This is my opinion - I would hope others - and in particular the sampler gurus here - come to similar conclusions.
Now, you probably have noticed that contributing to stan-math has been latley like running against the wall for me. Thus, I am exploring how waters are in stan; but that is absolutely nothing to consider in how we go about this.
I think we should settle first on wether we will be continuing this thought or not and then consider what it needs to decide between a separate sampler or a unified sampler. At least this is my view.
modified Stan code of warfarin ODE example without map_rect
: warfarin_ode.stan (6.8 KB)
EDIT: added one more run per setting for the ODE case