Parallel dynamic HMC merits

Thanks, Sebastian, for the simulation which confirms our intuitions about the potential speed ups. I would, however, like to point out that the summaries make it clear that the distribution of speed ups is quite skewed and grows more skewed for larger tree depths (which isn’t surprising given the logarithmic expansion). Consequently when speculating about practical speeds ups one has to consider how this skewed distribution convolves with the distribution of treedepths in a given fit.

In any case, a decision here is complicated by the tradeoffs involved.

Pros: The speed up is universally positive and can’t slow down the chains (presuming no overhead in the TBB execution). Consequently there is a performance benefit for those users who have excess CPUs available. This caveat makes quantifying the overall benefit tricky – in my experience there are more users with fewer CPUs then more, so this benefit would be limited to a relatively small subset of the community.

Cons: Overall speeds up are small. Requires modifying the sampler code (like a not unreasonable task, but not free). Maintenance burden of code capable of switching between parallel and serial execution. Maintenance burden of integrating TBB into Stan core (would not be a con if TBB were already being used in Stan core, but given that it’s not yet integrated this would be additional overhead to the contribution).

I think my biggest, worry, however, is that the inefficiency of the speculative computation would cannibalize TBB resources from other parallelized codes. Even if we presume that the TBB is able to allocated resources to contending parallelized functions without any overhead, the wasted computation of this proposal would consume resources that other functions would not be able to access.

Weighing these pros and cons is furthermore complicated by speculation and prioritization about the makeup of the overall user community. I think it would benefit all of us if we attempt to consider users unlike us when making these decisions.

For context there are multiple sampler features (like saving the entire Hamiltonian trajectories at each iteration) that offer smallish but uniform speedups that we do not currently implement because of memory or maintenance burdens (saving trajectories quickly blows out memory and it requires completely rewriting the sample analysis codes).

Putting this altogether, my opinion as current algorithm lead is that the potential speed ups here are not sufficient to warrant prioritizing a feature like this at the moment. This may change in the future as the TBB is more fully integrated into Stan and we have a better idea of how well it can manage multiple sources of parallelization. Personally I think a much more constructive use of the TBB in Stan core right now is to run multiple chains in their own threads, working out how to share a common data source (so the data doesn’t have to be duplicated for each chain – huge potential speed up for big models) and improving the adaptation to take advantage of sharing samples early on in warmup.

6 Likes