Parallel dynamic HMC merits

Hi there!

I have been thinking about parallelizing our dynamic HMC sampler. The obvious strategy to do that is to

  1. pre-sample all the fwd and bck turns up to the maximal treedepth
  2. then expand the fwd and back trajectories in full independence
  3. all checks still happen in the same order as before

This scheme works best if we have exactly alternating turns, of course. In practice things are not that ideal such that, after discussing with @betanalpha, we thought its good to first assess the merits of this approach using a simulation. The simulation code I wrote is attached if you want the details. The average speedup defined as parallel/serial is then by treedepth

   treedepth  mean   low_5% median  high_95%
       <int> <dbl> <dbl>  <dbl> <dbl>
 1         1  1     1      1     1   
 2         2  1.25  1      1.5   1.5 
 3         3  1.33  1      1.4   1.75
 4         4  1.36  1      1.36  1.88
 5         5  1.37  1      1.35  1.94
 6         6  1.38  1.02   1.34  1.91
 7         7  1.38  1.02   1.32  1.90
 8         8  1.38  1.02   1.34  1.90
 9         9  1.39  1.02   1.33  1.91
10        10  1.39  1.03   1.33  1.91

So on average we get for a treedepth >3 about of a ~35% speedup. That’s a bit below what I was hoping for, honestly, but this maximal speedup should apply to just about any non-trivial Stan model and the user would not need to change any line of Stan code. In terms of efficiency this is not too great as the ressources used are doubled to get the walltime down from 1 unit to 1/1.35 = 0.74 units.

The prototype I have coded up shows already promising results. Using the poisson hierarchical model I used for the tbb benchmarks (here I use 500 groups and 10 observations per group with a poisson-log-lik) I am getting a speedup of

## runtime in seconds
mean(c(83, 80)) / mean(c(66, 68)) = 1.22

The treedepth of the sampler is 6, so that we get from the theoretical speedup of 38% in practice 22%. Right now the prototype lacks exact reprodcability since the random number generator is accessed in a non-deterministc way (can be fixed with some caching), but overall the results are very close (though it’s odd to see slightly different stepsizes which make the comparison not ideal):

## serial summary output
Warmup took (48) seconds, 48 seconds total
Sampling took (35) seconds, 35 seconds total

                     Mean     MCSE   StdDev        5%       50%       95%  N_Eff  N_Eff/s    R_hat
lp__             -1.2e+04  3.7e+00  2.3e+01  -1.2e+04  -1.2e+04  -1.2e+04     39      1.1  1.0e+00
accept_stat__     9.3e-01  2.5e-03  8.4e-02   7.4e-01   9.6e-01   1.0e+00   1131       32  1.0e+00
stepsize__        5.0e-02  1.7e-16  1.7e-16   5.0e-02   5.0e-02   5.0e-02    1.0    0.029  1.0e+00
treedepth__       6.0e+00  2.0e-16  4.4e-15   6.0e+00   6.0e+00   6.0e+00    500       14  1.0e+00
n_leapfrog__      6.3e+01  6.4e-02  2.0e+00   6.3e+01   6.3e+01   6.3e+01   1004       29  1.0e+00
divergent__       0.0e+00  0.0e+00  0.0e+00   0.0e+00   0.0e+00   0.0e+00    500       14      nan
energy__          1.2e+04  3.6e+00  2.8e+01   1.2e+04   1.2e+04   1.2e+04     58      1.7  1.0e+00
log_lambda        1.6e+00  7.7e-03  4.4e-02   1.5e+00   1.6e+00   1.7e+00     32     0.93  1.0e+00
tau               1.2e+00  7.8e-03  4.0e-02   1.1e+00   1.2e+00   1.2e+00     26     0.75  1.0e+00
eta[1]           -1.4e+00  1.1e-02  2.6e-01  -1.8e+00  -1.3e+00  -9.3e-01    606       17  1.0e+00
eta[2]            6.1e-01  7.5e-03  1.0e-01   4.4e-01   6.1e-01   7.8e-01    177      5.0  1.0e+00

## parallel summary output (not that cmdstan reports total CPU time, not walltime)
Warmup took (72) seconds, 1.2 minutes total
Sampling took (56) seconds, 56 seconds total

                     Mean     MCSE   StdDev        5%       50%       95%  N_Eff  N_Eff/s    R_hat
lp__             -1.2e+04  3.4e+00  2.3e+01  -1.2e+04  -1.2e+04  -1.2e+04     44     0.78  1.0e+00
accept_stat__     9.3e-01  2.3e-03  8.2e-02   7.6e-01   9.7e-01   1.0e+00   1234       22  1.0e+00
stepsize__        6.7e-02  1.9e-16  1.9e-16   6.7e-02   6.7e-02   6.7e-02    1.0    0.018  1.0e+00
treedepth__       5.0e+00  8.3e-16  1.9e-14   5.0e+00   5.0e+00   5.0e+00    500      8.9  1.0e+00
n_leapfrog__      6.3e+01  1.1e-14  2.5e-13   6.3e+01   6.3e+01   6.3e+01    500      8.9  1.0e+00
divergent__       0.0e+00  0.0e+00  0.0e+00   0.0e+00   0.0e+00   0.0e+00    500      8.9      nan
energy__          1.2e+04  3.6e+00  2.8e+01   1.2e+04   1.2e+04   1.2e+04     61      1.1  1.0e+00
log_lambda        1.6e+00  5.8e-03  4.2e-02   1.6e+00   1.6e+00   1.7e+00     54     0.96  1.0e+00
tau               1.2e+00  6.7e-03  3.7e-02   1.1e+00   1.2e+00   1.2e+00     31     0.55  1.0e+00
eta[1]           -1.4e+00  8.9e-03  2.5e-01  -1.8e+00  -1.3e+00  -9.8e-01    760       14  1.0e+00
eta[2]            6.1e-01  6.2e-03  9.2e-02   4.6e-01   6.1e-01   7.6e-01    219      3.9  1.0e+00

The prototype runs at the moment on the basis of the released 2.20 dynamic HMC (so without the newest changes to the sampler) - but these results are promising to me. The complexity of the parallelization is managable given we use the TBB which I can use to nicely abstract away the parallelization through the use of a dependency flow graph, see here. Thus, with some (hopefully) light refactoring of the existing code base we should be able to add this feature in a way which minimizes maintenance… ahh… and this stuff runs without the need to refactor the existing AD; it only requires the uptake of the Intel TBB so that we can use a dependency flow graph which is automatically parallelized by the TBB.

Below is also a normalized density histogram of the speedup distribution at each treedepth.


parallel-dynamic_hmc-maxspeedup.R (1.5 KB)


This looks neat, but I’m not convinced. I would like to be able to run my models 20% faster, yes. The inefficiency bothers me some, but the big thing is I am against adding threaded code to our already complicated samplers.

It’s a cool idea and it clearly works, but how pissed would you be if this doesn’t go in :D? I’d like to discuss this at the meeting. Is the code up on Github somewhere?

I just used the first benchmark example I was able to grab. It clearly depends on the specific model how close you can get to the theorertical speedups. Pulling out an ODE model (from here), which I modified to turn the map_rect parallelism off (code is attached below) gives me for 200 warmup + 200 iterations a run time of

  • serial execution: 216s, 222s
  • parallel execution: 162s, 162s

That’s a 34.7% speedup !

The huge argument for me is that users do not need to change any Stan code line in order to take advantage of this technique. As you see - the heavier the computational burden is to get the gradient, the closer we endup to the theoretical speedups. So the more you suffer from your model run times, the more you get out of this if you can afford to double the used resources.

Sure, I see the point here, of course. And we absolutely need to balance the maintenance burden vs the utility. Right now we have in our codebase a number of samplers already which begs the question why not one more?

However, it could be a strategy to simply rewrite the existing sampler in a way that it always uses the dependency flow graph from the TBB. There are not too many things which would then change (I think). Right now the sample code is divided into the transition and build_tree function. As I have written it right now there are almost no changes to the build_tree function and the transition method is changed from a while loop to setup the dependency flow graph.

With my current prototype I was trying out if one can match the serial execution performance by wiring up the flow graph in a serial way. Serial execution with the poisson example then is

  • with the vanilla 2.20 code: 83s, 84s, 84s.
  • with the flow graph wired up serially: 88s, 87s

So you see that the performance of the serial vanilla version is in reach, I think. Thus, it is potentially an option to have just one sampler code written which is refactored to use a flow graph.

The code is not yet up on GitHub. It’s a bit too rough at the moment. I will think about a refactor of the code and upload that.

What would be more helpful at the moment is to agree about the compromises we want to make.

Options are:

  • don’t explore further as it’s not worth the trouble
  • fully separate implementations to not burden serial performance in terms of code complexity and any performance sacrifices
  • work out a single sampler which uses a flow graph as it’s way of working (your really don’t need to bother with threading if we use the TBB; the dependency flow graph fully abstracts away the threading bit).

This is not about personal things, to be clear. In my opinion a 33% speedup on ODE problems (or any other heavy problem) is fantastic given that there is 0 time needed by the Stan modeler to get this speedup; all what is needed is a large cluster with lots of resources. This is my opinion - I would hope others - and in particular the sampler gurus here - come to similar conclusions.

Now, you probably have noticed that contributing to stan-math has been latley like running against the wall for me. Thus, I am exploring how waters are in stan; but that is absolutely nothing to consider in how we go about this.

I think we should settle first on wether we will be continuing this thought or not and then consider what it needs to decide between a separate sampler or a unified sampler. At least this is my view.

modified Stan code of warfarin ODE example without map_rect: warfarin_ode.stan (6.8 KB)

EDIT: added one more run per setting for the ODE case


The usual—time to build it, review it, doc it, and maintain it.

We only really have two samplers that anyone uses—diagonal and dense NUTS with adaptation. And the code’s largely shared. I’d be happy getting rid of all the other samplers.

The obstacle’s going to be reuquirements it puts on other bits of code and whether the speedup’s worth the effort. I just don’t know how many people have your computing resources. I typically run models on my notebook with four cores. I might run on a ten core machine soon, but then I’d still only have two cores per Stan process.

@andrewgelman may be the most sympathetic, as he wants to get speedups any way he can. But I don’t know what he’ll think about this emphasis on getting a single chain to run fast as most of us don’t have the computational power to throw a multiple threads at multiple chains.

I run 4 chains on my laptop. But if there are ways to run big problems on a cluster using lots of processors, that would be great. If there were a way to do this on Amazon server and I could pay a few cents with my credit card to run a model faster, I might well do that.

As Bob says, I wouldn’t want to run just 1 chain. But I assume that if the computations could be set up on 100 processors for 1 chain, that either they could be set up on 400 processors for 4 chains, or else multiple chains could be run sequentially.

For one, we can attempt to refactor the code to accommodate both schemes easily - that would be the best thing if that’s possible.

I would strongly argue that anyone can use this technique. Suppose you are doing your model development right now using 4 chains on your 4 core laptop. Now, assume you have an annoyingly long fitting model, but you want to iterate over the model-development. In that case, you can sacrifice 2 chains during development of the model and instead use 2 cores per chain. To me that would be an acceptable compromise - and any Stan User can do that.

@andrewgelman What I am proposing here will allow you to run 10 models in the same amount of time it takes to run 7.5 models (or 20 vs 15) right now. The cost of doing so is using 2 cores per chain instead. So the bill from Amazon would double, but you get more models fitted in less time. And at the same time you would not have to change a single line of Stan code.

This super-user-friendliness here is - to me - the killer argument. Not many people use map_rect, because it is hard to use in complicated Stan models. Doubling the number of cores used is easily doable (for example, by sacrificing 2 chains out of the 4 we run).

This world is usually short of good human resources, not CPUs - that’s what I think.



That sounds good. Especially if I have 8 cores on my laptop and then can run 4 chains using 2 processors each. You’re right that it would be a plus to not have to alter Stan code.

I’m curious what Jonah, Ben G., and Paul think about this, because I guess it could be implemented in rstanarm and brms?

How would this model interact with lower level parallelization? IE if we use the TBB in the math library how do we not oversaturate the thread pool? Or is that less of an issue with the TBB?

1 Like

The TBB will be able to handle multiple levels of parallelism - but I would guess that if you have a good way to parallelise the log-lik evaluation in a way which scales efficiently and well, then you are better of with disabling the speculative HMC… although if you have large number of CPUs available, then you can still benefit with this.

The benefits of this approach is really that it is applicable to any Stan model without a change to the models themselves - the other huge benefit is that the proposed scheme can use our existing AD system once we go ahead with the TBB. There is no need to do a refactor of the reverse mode AD system; it just works out of the box.

Any Stan interface would only need to export one more option about the sampler. It’s like choosing between dense or diagonal mass matrix. Thus, this will be usable by a huge user base almost instantly.

I don’t have a strong opinion, but I do have some speculation about the future. My main concern is that I think we’d like to move towards ways of parallelizing most Stan models in other ways like

  1. the work that folks like Ryan and Matthijs are doing on the new compiler - automatically figuring out which parts of the program are independent (which gives us multiple instruction multiple data independent chunks, which are hard to do a lot with), and
  2. allowing users to more easily or even automatically farm out the processing of large chunks of data (single instruction multiple data style - much easier to parallelize and think about). Think an easy map call or a Stan program that knows about vectorized functions and how to automatically map them over multiple cores.

Neither of those won’t work on literally all models, but they should work on most models and the improvement should typically be larger than 20%. That said, that work is more speculative and further out in the future, so even if it eventually supersedes a parallel sampler there is some argument for both. I say supersedes here because I suspect that especially SIMD parallelism will be a much more efficient use of extra cores (because of increasingly wide CPUs with vectorization etc).

1 Like

Thanks, Sebastian, for the simulation which confirms our intuitions about the potential speed ups. I would, however, like to point out that the summaries make it clear that the distribution of speed ups is quite skewed and grows more skewed for larger tree depths (which isn’t surprising given the logarithmic expansion). Consequently when speculating about practical speeds ups one has to consider how this skewed distribution convolves with the distribution of treedepths in a given fit.

In any case, a decision here is complicated by the tradeoffs involved.

Pros: The speed up is universally positive and can’t slow down the chains (presuming no overhead in the TBB execution). Consequently there is a performance benefit for those users who have excess CPUs available. This caveat makes quantifying the overall benefit tricky – in my experience there are more users with fewer CPUs then more, so this benefit would be limited to a relatively small subset of the community.

Cons: Overall speeds up are small. Requires modifying the sampler code (like a not unreasonable task, but not free). Maintenance burden of code capable of switching between parallel and serial execution. Maintenance burden of integrating TBB into Stan core (would not be a con if TBB were already being used in Stan core, but given that it’s not yet integrated this would be additional overhead to the contribution).

I think my biggest, worry, however, is that the inefficiency of the speculative computation would cannibalize TBB resources from other parallelized codes. Even if we presume that the TBB is able to allocated resources to contending parallelized functions without any overhead, the wasted computation of this proposal would consume resources that other functions would not be able to access.

Weighing these pros and cons is furthermore complicated by speculation and prioritization about the makeup of the overall user community. I think it would benefit all of us if we attempt to consider users unlike us when making these decisions.

For context there are multiple sampler features (like saving the entire Hamiltonian trajectories at each iteration) that offer smallish but uniform speedups that we do not currently implement because of memory or maintenance burdens (saving trajectories quickly blows out memory and it requires completely rewriting the sample analysis codes).

Putting this altogether, my opinion as current algorithm lead is that the potential speed ups here are not sufficient to warrant prioritizing a feature like this at the moment. This may change in the future as the TBB is more fully integrated into Stan and we have a better idea of how well it can manage multiple sources of parallelization. Personally I think a much more constructive use of the TBB in Stan core right now is to run multiple chains in their own threads, working out how to share a common data source (so the data doesn’t have to be duplicated for each chain – huge potential speed up for big models) and improving the adaptation to take advantage of sharing samples early on in warmup.


I’m just surprised that you find 8 cores rare - all my colleagues have 8 core workstations, even my $110 phone has 8 cores… Or is 4cores a thing with the super thin laptops all the cool kids are using? I have no real data, and maybe I am atypical… Really just curious where that is coming from…

1 Like

Intel, that supplies CPUs for roughly two thirds of the PC market, doesnt really offer a wide range of 8 core CPUs, except for the i9, the i7 Extreme series and a few of the 300$+ i7-9700 that came out a few months ago.

So you are probably talking about virtual cores, which isnt really 8-core. AMD has the AMD Ryzen 7 with with 8+ cores, but those CPUs are also pretty fresh and I doubt its that common.

Embedded ARM CPUs are a different story, they dont have the baggage of x86, are RISC CPUs and its cheaper to make them X-core, but they arent really that common in the PC market. I am crossing my fingers they at least make it in the server market, though its a long shot.


I am not sure I get the punch line here. Is it “maybe later”?

To me the mission statement of Stan is to scale up to the toughest problems and solve these in the best possible way. We also want to make everyone else happy, of course. So the bulk of our users will not need high-performance, but a smaller circle of users will - and it is moreover a huge argument for the Stan platform to handle small and big problems on a single platform. From that perspective we definitely want speculative HMC as it is an option which will universally speedup the runtime while not requiring any modifications to your Stan model.

Let met get to your points in order as to what I think:

The TBB will handle the nested parallelism just fine. However, the TBB is not able to guess which parallel program regions are more efficient than others. So the speculative HMC stuff needs to be manually turned on or off. The TBB has concepts like task arenas so that you can limit the resources per chain if you want. Another scheme would be to dynamically turn this speculative feature on once some chains finish early. This gives the freed resources to the remaining ones. Very cool!

Even if we assume that users only have 4 cores - they are free to run just 2 chains but these with speculative HMC. That’s a totally reasonable choice to do when you do model development and want to make progress quickly. You may disagree with that, but I am sure this would be seen by many as a very practical way of progressing. The final model run would still be with 4 chains, of course.

In addition, the CPU vendors are getting up their performance by now through packing more cores on the die. Intel is lagging behind, but will catch up and AMD is giving us CPUs with many many cores already now.

Finally, Stan is a high-performance platform - so a smallish user base requiring these type of features are absolutely in our range.

I don’t think this point about skewedness matters:

  • In any given Stan model run the tree depth is well defined (what I have seen). So there is not much variation
  • even if there is some variation, the theoretical speedups are stably >30% with already a tree depth of 3.

So we are always getting >30% speedup on average and that is what counts to my understanding.

As I laid out we can probably have a single code base; at least we can attempt that.

We are going to get the TBB as it looks right now. Most of the work (Makefile wise) is already done by myself. That argument does not really hold.

I heard that point and started to look into it. My first impression is that it needs quite a bit of “surgery” to the codebase since the code is not setup to run chains in parallel obviously; maybe I need to stare a bit more at the interface, but that’s my first take. The speculative HMC is a more local operation within the sampler and is thus easier to get in.

Let’s see what people at the Stan meeting have to say about this today. Probably the least I will do is to refactor once my current prototype and push that branch to the stan repo along with an issue. That way we can return to this at any time (or just implement it now).


1 Like

Considered part of the focus of this thread, I thought I share the state of computing of my close research community, and maybe of a big part of the genetic/genomic applied research.

Given that genetic research is entering the doors of “big data” the majority of us has a 24/48/56 cores machine available. The majority of tools that we use, parallelize, and they would be otherwise unusable/impractical. I feel the biggest barrier now for applying Bayesian inference to genetic problem is execution time, not cores (my laptop has already 6 cores and 12 threads for example).

Moreover, I feel that the attention is often to first party users (me developing a model for one off-ish analysis, and keen to wait longer time that I would if I used a third party software), but with scalability a great opportunity is to convert many models in tools for 1000s of third party users that are quite impatient and non necessarily familiar with statistics or computing.

If parallelisation becomes easy to implement in the models (at the moment is incredibly challenging due to manual data packaging) this trend might flip. Many potential third party users that have cores available would base their analyses on Bayesian statistics. Who builds models does not have to be necessarily the end user.


I am finishing a project where our client said exactly what you are saying @stemangiola , that they have many potential users if there was any way of speeding all this up some more. Whether its CPU parallelism, GPU parallelism, they dont really care. They have resources for the hardware.

In the end we did a custom gradient that was moved to OpenCL and we got from 55ms per gradient to 0.6ms per gradient which meant that the model execution goes from 30 hours to 20 minutes for their currently largest dataset that is expanding all the time (its data from clinical trials). That is now going to be distributed to their other users as an R package on Github that will essentially be a very specific cmstdanR interface. Its a way too long way to get to speedups for something we could get fairly close to with a general approach (with TBB, OpenCL or whatever).

Dont get me wrong, I am fine re-doing this whenever anyone wants (I do love seeing speedups of ~100 :) ) since it gets me funds that allow me to have more time to work on Stan Math development in the coming months. But we should aim for a general approach that will help all users. We just need to get the TBB thing rolling for CPU parallelism. We also have some ideas how OpenCL can help with all larger problems, not just specific functions and GLMs.

Sorry to get a bit offtopic.

To get back on topic:
I agree that SIMD parallelism will be better in the end and as I already said am offering my assistance wherever I can help to get us to that promise land. Whether we want to include Sebastian’s idea presented here for the meantime or not is above my paygrade or knowledge of HMC :) I do find it awesome.


The Stan meeting is in 20 minutes (10AM EST September 12, 2019). I guess I’m late on this, but if someone watching this wants to discuss this more you are welcome to join: Stan General meeting, Sept 12, 2019 11am EDT

+1. Maybe the world where this makes sense meets the following criteria:

  1. This is off by default and only turned on with a flag
  2. The code paths for single-threaded and parallel are the same, thanks to the wonders of the TBB
  3. The code is not significantly more complicated afterwards. Maybe something like less than ~20% code bloat?

I think this is ultimately up to @betanalpha, just giving my two cents on how I might think about things like this.


The issue to keep record of this is here:

The code (without any refactor) of the prototype is here:

That prototype is not coded pretty at all - but given all the counter arguments I will probably stuff it there for the moment being and don’t spend more time on it unless things turn.

@bbbales2 if you see a way to handle the new sampler with this scheme, I would be curious.

1 Like