Parallelizing the sampler (not the model)

Has any work/thought been given to parallelizing the sampler itself? For instance, assuming I shuffle gradient evaluation out of Stan to somewhere else (with e.g. MPI) then the sampler itself presents some overhead (10s to 100s of ms), especially for larger numbers of parameters (1M+). Is there a good reason the sampler should be single threaded?

The HMC sampler is inherently sequential, you need to know the previous gradient before you can decide where to evaluate the next gradient. Only a little bit of parallelism is possible, the most recent discussion is in

3 Likes

Thanks, I was wondering more in terms of parallelizing the symplectic integrator over unconstrained parameters or perhaps more general issues such as memory allocations if there was room for speed up on large models. For instance, I see a lot of allocation going on during the build_tree part, and would wonder if preallocating could help.

1 Like

Symplectic integrators are also sequential up to the gradient evaluation, but that just reduces to parallelizing the target density function.

There’s a decent amount of allocation in the dynamic trajectory evaluation, but the actual memory footprint at any give time is quite small. The cost of those memory allocations is unlikely to substantial relative to the gradient evaluations themselves, but the sampler is actually pretty amenable to pre-allocation through by replacing Eigen::VectorXd with a wrapper class that overloads allocation similar to what’s done in the autodiff library.

2 Likes

We were just discussing this in another thread starting with this comment from @Red-Portal.

Memory pressure becomes a much bigger deal when parallelizing, as typically only a relatively tiny L1 cache is on-core and the L2, L3 cache and RAM are shared. There’s also the problem of memory locality with a bunch of ad-hoc allocations. Allocating everything together in contiguous memory could be a big saving. This is something I really messed up in coding Stan arrays as C++ std::vector as we have this problem all over the place.

There’s some relevant discussion in a systems paper published a while ago:

The paper shows that Stan models become L3 cache bound as the number of chains increases. The paper also proposes prefetching-based fixes to this problem, but I’m not sure if these actually made it into Stan?

Thanks, @Red-Portal. I hadn’t seen this paper, but I’ll take a look. I’m not surprised we’re L3 cache bound.

One of the things that’s improved over time is that we can now use shared data with multi-threading rather than copying data in multiple processes. I’m not entirely sure how much that’s filtered through to implementations in cmdstan. Nevertheless, it’s not going to solve the random access problem of large heterogeneous data sets or even reasonable access across unsynchronized threads.

Google TFP has solved this problem by synchronizing the evaluation of the log density across multiple cores/GPU calls. It requires heroic efforts on the algorithm side, but it’s going to be worth it for speeding things up. For example, see Matt and crew’s paper An Adaptive MCMC Scheme for Setting Trajectory Lengths in Hamiltonian Monte Carlo or their earlier large group paper led by Junpeng Lao tfp.mcmc: Modern Markov Chain Monte Carlo Tools Built for Modern Hardware.

3 Likes

I think when we start talking about 1M+ parameters like @maedoc is saying above there are no small footprints :)

It’s def possible to implement a small local allocator for nuts. The most bare bones version would just pass a local stack_alloc to each transition() call and every time we want to make a ps_point or eigen vector inside of recursive nuts we get our memory from that local stack_alloc (and reset the stack_alloc after each transition). We would need to make some changes to ps_point so instead of holding vectors/matrices they hold Eigen::Map types and it would need a constructor that takes in a stack_alloc. It would take a little elbow grease but I’d list it as very possible

Since we are tossing ideas around, I should point out being sequential can be orthogonal to parallelism. For numerical integration there is parallel-in-time (e.g. Parallel-in-Time (PinT) Algorithms · GitHub), with parareal being the first implemented. One way to think of parallel-in-time is to consider the time integration as solving a large system with solutions lining up in the unknown vector and the integration scheme in the lower triangle coefficient matrix, and solving the entire system with multigrid methods. For NUTS what I haven’t thought through is how to bake the proposal/rejection step into such a scheme.

1 Like

Sure, but one also has to consider the relative footprint compared to other computations. Allocating memory for vectors will a million elements will certainly start to become more expensive, but will that cost be substantial compared to the gradient evaluation?

Yes sequential is not mutually exclusive with parallel, but it does mean that parallelization is possible only within the computation at each iteration.

For Hamiltonian Monte Carlo we’re not using general numerical integrators but rather symplectic integrators, almost all of which are built from sequential updates that don’t afford much opportunity for parallelization beyond the gradient evaluation.

Other than the fact that MCMC overall is embarrassingly parallel. Run N chains and get N times the expected ESS. The serial bottleneck is getting that first effective draw in the sense that with N processors, we can get N draws in the time it takes a single processor to get 1 draw. This is why I’m so keen to parallelize and speed up our warmup stages.

1 Like

Unfortunately Markov chain Monte Carlo does not work that way.

In order to pool Markov chain Monte Carlo estimator from multiple Markov chains each of those estimators needs to be well-defined. For finite length Markov chains this means that each chain needs to be run long enough for the initial bias to have decayed and a central limit theorem to have kicked in. The decay of the bias scales as N^{-1} while the error of estimators when a central limit theorem holds scales as N^{-1/2}, so that even under ideal conditions each Markov chain needs to be run long enough for N^{1/2} \gg 1. In practice the coefficients matter and it’s better to think about the effective sample size for a given estimand instead of just the raw N, but the general considerations are the same. Perhaps more importantly in practice we have to also consider the adaptation burden.

A very aggressive strategy would be to run until the effective sample size for the relevant estimands were all greater than 10. If we need only effective sizes of O(100) to achieve reasonably small estimator error then there’s not much to be gained with more than 10 or so Markov chains.

Because of the sequential nature of Markov chain Monte Carlo it’s always more efficient to extend an initial Markov chain than it is to start a new one from scratch, which limits the “embarrassing” parallelization opportunities.

I think we have to be careful with this based on the application. The simple fact of the matter is that MCMC is embarrassingly parallelizable by running multiple chains. The expected ESS is N times as large running N chains for M steps as the expected ESS for running 1 chain for M steps.

However, @betanalpha’s point holds that you need to run each chain long enough to diagnose that you’ve converged. Or you have to move over into the regime that @charlesm93 is exploring using a nested notion of R-hat, where the goal is to massively parallelize really short chains,

But even then, you have to get around the transient bias and initialization bias, so there’s not a huge gain to be had over ESS = 10 in all chains.

Let me unpack a little bit what is explored in the nested-\hat R (or \mathfrak n \hat R) paper, and how this relates to parallelization schemes for MCMC. Just to nuance a bit @Bob_Carpenter 's statement, the chains are not “really short”, rather they have a really short sampling phase with a warmup that is as short as possible but not shorter.

The first point my co-authors and I make is that it takes more than convergence to diagnose convergence with \hat R. You also need a large effective sample size per chain. So even if you wanted to run 10 chains with an ESS of 10 per chain, you probably wouldn’t get \hat R to converge to 1. The problem is worse if you want to run 100 chains each with an ESS of 1. \mathfrak n \hat R compares groups of chains, replacing the requirement for a large ESS per chain by one for a large ESS per group of chains (and equivalently a large number of chains). Beware that you need certain initialization conditions for this strategy to be effective, as described in the preprint.

I have a result which gives you exactly what \mathfrak n \hat R measures before stationarity, with guidance on how to pick a threshold i.e. what’s the difference between 1.1, 1.05, 1.01, 1.001, etc., and I’m working on the optimal splitting of chains into groups. The latter is pushing me to look for more formal ways to define overdispersed initializations and is relevant to the study of regular \hat R. But yeah, this is ongoing work…

Now, even if you run all the chains, you still need to warm them all up, as has been explained in this thread. So you will not get the order of magnitude speed up you would expect from massive parallelization. If you can parallelize gradient evaluation, that’s how you probably want to use your cores. Furthermore, in the extreme case where each chain only has one sampling iteration, it would only make sense to run a number of chains equal to the target ESS. In my experience, this ranges from 100 to 500. So you’re far from utilizing the thousands of cores available on a GPU.

Now we may reason about what it means to warm up the chain. HMC’s warmup, as implemented in Stan, has two complementary goals: (i) to make the initial bias negligible and achieve approximate stationarity and (ii) to tune the sampler to reduce the Markov chain’s autocorrelation, once stationarity is achieved. For the latter, Stan adapts the step size and tries to build a mass matrix based on early estimates of the posterior variance.

If each chain is comprised of only one sampling iteration, the goal of the warmup reduces to (i). Naturally, tuning the sampler well helps you converge faster to stationarity. But you might expect that constructing the first unbiased posterior sample takes less time than estimating the posterior variance when tuning the mass matrix. In this sense the many-short-chain regime presents opportunities to shorten the warmup. Some of these ideas are explored in papers by Matt and crew that @Bob_Carpenter alluded to.

At the same time, cross-chain warmup schemes, which share tuning information between chains, do benefit from running many chains, meaning you get faster adaptation per iteration. (with some pitfalls I’m happy to discuss)

Something I would like to formalize are the ways in which running many chains and using only a short sampling phase may reduce the warmup burden. There are ideas on the subject, though in my view this largely remains an open question and, what’s more, a difficult one to answer. One would need to define what the target tuning or chain autocorrelation is, and examine across a range of models whether this target is already achieved once the chains converge.

To sum up, there are straightforward ways to use chain parallelization but to actually make it interesting is less straightforward. What is more, all this is contingent on having access to hardware which supports massive parallelization, though having 100’s of cores is not necessarily out of reach, even on CPU.

5 Likes

Thanks for the clarification, @charlesm93. The language around all of this is garbled in all of our doc. In particular, the term “chain” is misleading, because what we call a chain in Stan actually consists of two phases, only one of which forms a Markov chain, namely

  1. warmup iterations, which are not draws from a single Markov chain, followed by
  2. sampling iterations, which are drawn from a single Markov chain.

As @charlesm93 says, it’s the sampling phase that is short here. I don’t know how to talk about that concisely and agree that calling the whole chains “really short” is misleading.

The warmup is a nuisance and presents the serialization bottleneck for sampling. As @charlesm93 and others have noted, if you can massively parallelize, you only need ESS = 1 from each chain. Then most of the time is spent on “burn in” (sorry, @andrewgelman, but I need to distinguish the adaptation part of warmup from the part traditionally called “burn in”). I think you may actually be able to get away with less than ESS = 1 per chain, but that’s a different story.

I originally proposed developing something like Pathfinder to overcome the serialization bottleneck posed by the warmup iterations, which are intrinsically sequential in Stan (and to be clear, my initial approach to Pathfinder didn’t work, it took @Lu.Zhang’s and @avehtari’s insights to get it to work). So I’d like to test the multi-path version of Pathfinder with hundreds of single-path executions followed by some short chain MCMC. As we show in the paper, it’s much faster than using even Phase I of Stan’s warmup, but we never evaluated how it compared to running sampling for much longer (though we did evaluate against reference posteriors).

4 Likes

Just for terminology, maybe it would make sense for us to label the first set of iterations as “Adaptation” rather than “Warmup” and then the distinction will be more precise.

I agree with your general point!

If M is large enough that a Markov chain Monte Carlo central limit theorem for the relevant expectants holds then yes an ensemble estimator from multiple Markov chains has an equivalent effective sample size of N times the effective sample size of the estimators from each individual Markov chain. But that condition is not trivial.

In other words generating multiple Markov chains is embarrassingly parallel, but constructing a meaningful Markov chain Monte Carlo estimators from multiple Markov chains is a different consideration altogether.

Markov chains don’t need to be long enough just for diagnostics.

Firstly Markov chains need to be long enough for a Markov chain Monte Carlo central limit theorem to kick in. This is the minimal Markov chain length and it cannot be avoided without appealing to other estimation methods (namely coupling techniques that pull a single exact sample from each Markov chain to allow for regular Monte Carlo estimation).

Assuming that they then need to be long enough for convergence diagnostics to stabilize to give reasonable sensitivity (this is typically longer than what is needed for the central limit theorem to first kick in, but without theoretical guidance we have no way of determining that minimal time without diagnostics).

Moreover assuming that a central limit theorem holds they also need to be long enough to accurately estimate the autocorrelations in each Markov chain, and hence to estimate the effective sample size for each Markov chain Monte Carlo estimator. The more correlated the expectand values across the states in the Markov chain the longer of Markov chain realizations we need.

Let me try to reiterate a few points.

There is a difference between Markov chains and Markov chain Monte Carlo. Markov chain Monte Carlo constructs Markov chain Monte Carlo estimators of expectand expectation values from Markov chains, but those estimators approximate exact stationary expectation values only in special cases.

The typical approach to Markov chain Monte Carlo relies on the Markov chain Monte Carlo central limit theorem. For sufficiently nice Markov transition distributions, sufficiently integrable expectants, and long enough Markov chains the Markov chain Monte Carlo central limit theorem probabilistically quantifies estimator behavior. In particular is shows that in these circumstances estimators are unbiased and their precision is determined by the asymptotic variance which is a stationary property of the Markov transition related to the effective sample size.

The effective sample size is related to the asymptotic variance (expectand effective sample size = expected variance / expectand asymptotic variance). If a Markov transition distribution doesn’t obstruct a Markov chain Monte Carlo central limit theorem then asymptotic variances, and hence effective sample sizes, will always exist but they will quantify estimator behavior only from sufficiently long Markov chains where the central limit theorem actually manifests.

Typically by the time a Markov chain is long enough for a central limit theorem to kick in it will already contain a reasonably large effective sample size – nothing is universal here but around 10 is not a bad approximation for most cases. Consequently effective sample sizes smaller than 1 are never really relevant.

The Markov chain Monte Carlo central limit theorem is even more important when talking about constructing ensemble estimators from multiple Markov chains. One doesn’t pool the states within the Markov chains first and then construct an estimator but rather construct estimators and then pool those estimators together. When all of the estimators from each of the M individual Markov chains satisfies a central limit theorem then the pooled behavior of the estimator will also be quantified by an appropriate normal density function only with a variance smaller by a factor of M. Hence the conditional parallelization discussed above.

There is a way to pool states before constructing estimators, but this requires an entirely different Markov chain workflow. The “coupling” methods that have become popular in the statistics literature over the past few years introduce particular stopping rules that under certain conditions ensure that the final state of a realized Markov chain is an exact sample from the stationary distribution. As I mentioned above these exact samples from many Markov chains can then be used to construct regular Monte Carlo estimators of stationary expectation values.

There are similar considerations for how short these Markov chains will be before the stopping rule is triggered, and (preliminary) theoretical analysis (for simple problems) shows that one has to do essentially the same work as above.

Note that all of these considerations influence adaptation as well. Because adaptation is based on Markov chain Monte Carlo estimators any adaptation phase has to be long enough for these estimators to be sufficiently well-behaved. Moreover because poorly configured Markov transition distributions induce stronger autocorrelations they require longer Markov chains for the estimators to behave themselves (even when pooling across Markov chains per the above considerations), which is why warmup will always be more onerous than the main sampling phase.

2 Likes

To what extent is it feasible/advisable to reverse the asymptotic such that (1) each chain is sampled a fixed (N) number of times, and (2)new chains are spawned as old chains end their sampling if a stopping criteria is not reached. For e.g. if M is the number of chains then M > 100 and potentially M \to \infty. The idea being that if one has 100 cores then one spawns 100 chains and as each chain completes (say N = 25 iterations), a new chain is spawned if some stopping criteria is not met or if enough chains have not completed. Now, suppose we anticipate ~1000 short chains then with 100 cores that is ~10 chains per core (each with 25 samples) = ~250 samples per core. If the cores are differently capable then the cores whose chains finish first spawn the new chains thereby soaking up the compute capacity completely.

I don’t mean to suggest writing code and implementing anything but more to propose the above thought experiment as a possible alternative computational model for the distant future. E.g., we have new GPUs coming that have up to 144 streaming multiprocessors with 128 cores per streaming multiprocessor. A 8 GPU server should be able to sample as many as 144 * 8 simultaneous chains very easily. From that perspective M can be huge in the distant future (lets say in 10 years time).

Guarantees for behavior in these settings are notoriously difficult to work out, and then even more challenging to implement robustly. The key issue is that one would have to work out the general, non-equilibrium behavior of the stopping criteria which will not only depend intimately on the sampler and its target distribution but also the precise initialization and any ongoing adaptation. False negatives are not so problematic – at worst a small decrease in computational efficiency – but false positives – stopping a Markov chain early – are deadly for the behavior of any corresponding estimators.

Yeah, that makes sense. I am (perhaps naively) much more optimistic about the implementation than the conceptual aspects. I use TensorFlow extensively. In it almost all the heavy lifting is handled by the distribution strategy. I would imagine that in the coming years, what we want to do can be done by leveraging some modern ML framework to make the code platform and hardware independent. The conceptual point of when to stop, as you point out, is much harder to work out.