Within-chain parallelization idea (maybe crazy)

The recent discussion about setting expectations for reduce_sum got me thinking. Is there any chance that a more consistent speedup is available by computing the forward and backward parts of the HMC trajectory in parallel, so that whenever the tree expansion proceeds in the direction where the trajectory is so-far shorter, the work has already been done? On the negative side, this approach would throw out a non-trivial amount of integration in one direction or the other at every iteration. Nevertheless, this approach might yield worthwhile speedup, provided that the information necessary to evaluate the stopping criterion can be efficiently copied/shared between processes.

Provided that the stopping criterion can be evaluated efficiently, it seems like there would be lots of cases where this approach yields useful speedup even though reduce_sum cannot (i.e. if the gradient evals are fast but the treedepths are large). And it seems like it could be implemented entirely on the backend, which no changes required to a user’s Stan code. And with more cores available it could be combined with reduce_sum for additional speedup.

Just a thought. Has this already been discussed and ruled out? Is it worth thinking about further?

1 Like

@betanalpha @nhuurre

1 Like

Tagging @wds15 who did this a while ago! Tmk it was like a 15% speedup?


FWIW there are scenarios where I would love to get my hands on 15% speed-up if I can do it without modification to my Stan code.


The conclusion was that one would get about 35% speedup when doubling the required number of CPUs. The user would not need to change anything in the Stan model and the optimization would apply to any Stan model.

While I personally thought this would be a good thing, the final decision was to not implement it given the increased code complexity and rather little gain in speed compared to the required Ressource needs of doubling the number of cpu cores.

Maybe this is worthwhile to reassess given the we are getting more and more cpu cores even on laptops? To me the key argument for this optimization is that it applies to all models and nothing needs to be changed (other than throwing in more cores).

Edit: there is a fully working implementation of this (dirty)


I strongly suspect that this efficiency gain per additional resource is on par with or exceeds quite a few applications of reduce_sum even for the first doubling of resources. But importantly, the relevant question isn’t whether 35% gain is competitive with reduce_sum for the first doubling of resources, but rather whether it’s competitive with reduce_sum for the last doubling. If I have 8 cores per chain to play with, and reduce_sum gives me some gain at 4 cores per chain, then the question is how often I would expect reduce_sum to be competitive with a 35% gain associated with increasing from 4 to 8 cores per chain. In my own work, I have literally never seen a 35% gain from 4 to 8 with reduce_sum, let alone from 8 to 16 or beyond.

I could try to bring the prototype up to date with the current 2.28 and then you try a few models? Maybe having more convincing power user examples makes it worthwhile to reconsider this and go for a community vote? Back then we did not have a vote on this due to a lack for this process if I recall correctly.

… but others would be really welcome in helping to make progress (@nhuurre ?)…


In my own work, I have literally never seen a 35% gain from 4 to 8 with reduce_sum , let alone from 8 to 16 or beyond.

This is actually to be expected. Likelihood parallelization only starts to give benefits when (1) the likelihood is a product of a huge number of data points, (2) the Hamiltonian trajectory is rather short. “Huge” here is in bold because evaluating the individual data points is often really cheap. So the constant cost of parallelizing anything dominates most of the models people will want to run on laptops. Amdahl’s law kicks hard on these models.

At the moment, there is no genuine HMC-compatible solution that achieves decent speedup through parallelization.