The recent discussion about setting expectations for `reduce_sum`

got me thinking. Is there any chance that a more consistent speedup is available by computing the forward and backward parts of the HMC trajectory in parallel, so that whenever the tree expansion proceeds in the direction where the trajectory is so-far shorter, the work has already been done? On the negative side, this approach would throw out a non-trivial amount of integration in one direction or the other at every iteration. Nevertheless, this approach might yield worthwhile speedup, provided that the information necessary to evaluate the stopping criterion can be efficiently copied/shared between processes.

Provided that the stopping criterion can be evaluated efficiently, it seems like there would be lots of cases where this approach yields useful speedup even though `reduce_sum`

cannot (i.e. if the gradient evals are fast but the treedepths are large). And it seems like it could be implemented entirely on the backend, which no changes required to a userâ€™s Stan code. And with more cores available it could be combined with reduce_sum for additional speedup.

Just a thought. Has this already been discussed and ruled out? Is it worth thinking about further?