Evaluating parallelization performance

I’d like to consider what we’re evaluating when we consider parallelizing evaluation of the the log density or MCMC algorithms. I was writing a comment for Stan issue #2818, but thought the discussion should be generalize. That issue and associated branch illustrate a 35% speedup of a single chain by using two cores to evaluate the Hamiltonian forward and backward in time simultaneously, but the percent speedup isn’t so important compared to when we need it.

There are two primitive evaluations.

  • Task 1: speed to convergence

  • Task 2: n_eff / sec after convergence

For task 1, parallelizing a single chain dominates. For task 2, running multiple chains dominates because MCMC is embarassingly parallel. To summarize,

  • After convergence, we’re better off running multiple chains than parallelizing a single chain. Before convergence, we’re better off parallelizing log density evals and the MCMC algorithm .

What I think we do in practice is a combination of tasks 1 and 2.

  • Task 3: speed to given n_eff

We might target n_eff = 1 for debugging and rough model exploration. We’ll target n_eff = 100 or 1000 for inference, though our editors and users may demand more.

The smaller the n_eff target in Task 3, the larger the relative performance gain we’ll get for parallelizing log density and HMC algorithms.

What I’d like to think about going forward is how to

  • massively parallelize adaptation, and
  • how to monitor adaptation for convergence so we don’t do more of it than we have to.

During adaptation of static Euclidean HMC, our goal is to

  • find the typical set, and then
  • explore the typical set thoroughly enough to
    • estimate covariance (for the metric), and
    • adapt step size to match acceptance target.
7 Likes

I absolutely agree on that. In all the discussions this sticks out to me as a very straightforward thing to do and it will hugely improve things. Once we have threaded warmup with multiple chains it will pay off to start multiple chains in parallel for warmup which is right now a painfully slow process since the information is not shared between he chains.

The only obstacle I see in doing that is the stan services which are not build to handle multiple chains - I will look a bit further into the matter and… let’s see… maybe there is an elegant way.

(I won’t have lots of time this month though to do it)

Now, speculative HMC could still be useful to be turned during some phases of the warmup - like finding the typical set, for example. Thus, the speculative HMC should still be on our horizon, I think; because dynamical switching it on or off as needed can improve in certain phases of what we have to do - like finding the typical set or whenever some chains finish early the freed computational resource can be given to the other chains.

EDIT: This reminds of one thing: We should really switch the reported time in stan services from CPU time used to wallclock time used. It’s neff / wall clock time what we are interested in.

3 Likes

Similar thing hit me in a different context: if we can control the behavior of some math function according to where the sampler is, there are some performance gain can be made. Like in ODE models sometimes even if I know the problem is non-stiff I have to run it in stiff solver because the sampler stepped into regions the solution becomes stiff.

Within-chain parallelization can be done in various flavors, and currently we are not designing it in a organized way(which is the point of the thread, I guess). There are things easier/better to be done in distributed level, and things better in thread level. Even in a same level we can have multiple tiers. Unless we can reach a design decision there are going to be many iterations in future designs.

2 Likes

“Convergence” is a term that, while useful, has some limitations when it comes to the more formal side of the theory.

Intuitively a Markov chain will initial “converge” into the neighborhoods of high probability within the the typical set before settling in and exploring the span of that typical set. These initial out-of-the-typical-set samples states are not really useful for informing MCMC estimators and hence can be discarded. This is the original motivation for the warmup/burnin process (in Stan warmup is dominated by adaptation which uses the early exploration of the typical set to tune the sampler configuration).

There is another “convergence”, however, that happens once the Markov chain has found the typical set. This is the convergence of the central limit theorem. See even if an MCMC central limit theorem holds in a given application it is an asymptotic property; for finite iterations there is an \mathcal{O}(1 / \text{ESS}) bias between MCMC estimators and the exact expectation values. For sufficiently large effective sample size this bias is negligible compared to the MCMC standard error and is hence typically ignored, but if the effective sample size isn’t large enough then the bias can be really important.

Because of this estimators defined as averages over lots of short chains will inherent the bias and not converge to the exact expectation value as expected from a central limit theorem. In other words there is a certain minimum effective sample size per chain that is required before chain estimators can be beneficially pooled.

To be safe we’d want at least \mathcal{O}(10) effective samples per chain before merging their estimates together. This significantly limits the potential speed up from running parallel chains. For the typical use cases where we need hundreds of effective samples to get sufficiently precise estimators hundreds of parallel chains won’t offer much practical benefit.

2 Likes

Thanks, Michael.

If it’s bias and not just error, what’s the bias? Is it toward the starting point in some way? Is this written up somewhere I can try to read it in more detail?

Does this hold for the way Stan does adaptation? In Stan, the starting point is already the result of a long MCMC chain run during the last of the phase II (“slow”) warmup blocks. I thought for adaptation to converge, we need good mixing through the posterior. Isn’t that going to be at least {\mathcal O}(10) draws, or are those in some way biased, too?

One of the things I wanted to do was focus evaluation away from marginal effective sample size per second because we only need a limited sample size when we’re done. Let’s say we need an effective sample size of \mathcal{O}(10^2) or \mathcal{O}(10^3). That limits chain-level parallelism to \mathcal{O}(10^1) or \mathcal{O}(10^2) if we need \mathcal{O}(10^1) draws per chain to remove bias. So you can’t just throw 10^3 processors at this problem because we don’t need 10^4 effective draws.

2 Likes

When I am suggesting to double the number of cores I am being heavily criticized…and now we discuss 100 cores???

I don’t quite get that.

Anyway, I would prefer to start with a scheme which is deterministically always giving the same. My hunch is that we should synchronize adaptation info whenever the current windows close. This way each warmup chain can explore things fairly independently for quite some time.

Synchronizing every iteration is not feasible, obviously. One could use buffers per chain to get around some of the problems… but that’s all not so nice.

What I have in mind to synchronize at the end each adaptation window could probably be coded up fairly easily with the Intel TBB dependency graph thing. So I imagine it to be relatively straightforward, let’s see.

More cores is the only way we’re going to get speedups going forward! I don’t think anyone’s opposed to that.

That’s a great place to start just to see if there’s an advantage from sharing across chains. I’m particularly curious about what happens as the number of chains grows from one to hundreds. Particularly in terms of converging better or faster (in number of iterations, not wall time).

It could be an interesting, but somewhat dangerous, approach to weed out bad warmup chains which run slowly. So we Start with more warmup chains than we want to use and stop the slow ones in early stages. But let’s start simple.

Do we have models where we know “optimal” stepsize & inverse metric?

Are the stepsizes / inv metric following some distribution? Say we do warmup 10k times (10k chains), what comes out.

Models to test with “iid stuff”, “highly correlated” and “funnel like”.

Also, how bad is just to take median / mean for stepsize + inv metric, if it is really nD distribution.

Good thoughts. Let’s get the tbb in and then we start with the low hanging fruits… there are some to grab!

This is all related to @andrewgelman’s desire to run chains based on time. So we run four chains for 2 minutes each and take what we get. I believe he and @yuling think stacking can sort out the bias, but I haven’t seen any evaluations of that.

Multivariate normal. The inverse metric’s just the covariance and it doesn’t vary. Optimality for stepsize depends on the goal as the answer’s different for different parameters, like even theta and theta^2 can vary in optimal choice of stepsize.

With a lot of draws, we can estimate posterior covariance pretty well for models we can fit. What should happen is that when we run a bunch of chains, the estimated covariance and stepsize should converge in all chains to the same value.

You mean look at all the chains and average their stepsizes and covariance metrics? If all the chains are the same size, I think this’ll give the same result as pooling all the draws.

Hi, see below.

Great post(s)! Totally agree. I’m wondering (and maybe you are too) about the mechanism of this post-warmup bias present in each chain - might it cancel out with many chains, for example?

But doing chain shooting during warmup should be fine to do. Nothing to worry about or am I wrong?

Basically, although you have to be careful in exactly what you are comparing. It’s a bias in the expectation value estimates due to the N-step distribution being biased towards a delta distribution at the initial point.

It’s talked about in the derivations that go from geometric ergodicity to central limit theorems a bit, so it should be in Meyn and Tweedie and Douc et al (https://link.springer.com/book/10.1007/978-3-319-97704-1). It’s also discussed in https://arxiv.org/abs/1708.03625.

If the warmup phase goes well then it should take care of the necessary samples, but then you still have the warmup overhead for each chain.

Exactly.

Parallelizing chains, when done correctly as discussed above, gives you linear speed up of effective sample size with respect to each core. The doubling scheme you proposed gave much worse than linear speed ups. That said, 100 cores is unlikely to be of much use given the total effective sample sizes needed in practice.

Right, this is what I was suggested on the GitHub thread.

Noooooooooooooooooooooooooooooooooo! Poor adaptation almost always means pathological interactions between the target distribution and the sampler. In that circumstance the scaling arguments become irrelevant. Specifically the effective sample size is then no longer related to estimator error bounds so effective sample size scalings don’t matter.

The step size and metric are not probabilistic objects – they are degrees of freedom in the sampler. One can work out optimization criteria for these, but they yield deterministic optima. When running adaptation to try to achieve those optima we get probabilistic estimators of the optima, due to the probabilistic history of each Markov chain, but that is a function of the estimation and not the ideal values.

The optimal step size certainly depends on the optimality criterion chosen, but all optimality criteria that have so far been introduced are global and do not consider particular the estimation of particular functions.

No, that’s why we call it a bias and not a variance-- biases don’t average out over multiple realizations. Pooling chains is much more subtle than is typically presented in the literature – indeed much of the confusion/people talking past each other in the “one long chain or many small chains” debates of the 1990s was due to the opposite sides not carefully considering these important behaviors!

Stacking is mostly useful and justified when all parallel chains themselves are approximately convergent (their stationary distribution). That is a niche situation where the posterior has some metastable regions and the transition between regions are difficult, and the geometric ergodicity may not hold in the first place. Hence parallelization with random initialization speeds up both Task 1 and 2.

Back to the “one long chain or many small chains” debate in the 90s, some people also suggested to discard bad warmups, or restart in a single run. The resulting inference is biased. Take the funnel as an example, any chain with an initialization \tau \approx 0 will move very slowly under most samplers, while \tau \approx 0 is indeed part of the typical set. In view of this, we may still keep the “bad” chains but reweight them. Stacking is a somewhat adhoc but potentially useful approach for this task.

Another more extreme but motivating situation is Riemann sum/grid approximation: which can be viewed as a very large number of chains, and each chain only contain 1 sample that is drawn from the initialization (typically a uniform grid). These chains has not mixed, and they are not convergent to their stationary distribution, and CLT does not kick in within chain. Nevertheless, a Riemann sum take all these non-converegne into account and reweigh each chain differently. Here we are asking for a third level of convergence-- besides the ergodicity and CLT as Michael point out-- what will happen if the number of chains also goes very large and we reweigh them properly (either by importance weights in Riemann sum, or by chain-stacking).

I was imagining that if, as you say, basically each chain is biased towards its starting point, and then you chose a bunch of starting points all equidistant in some kind of hypersphere around the typical set, then they could balance out. First brush I thought that might scale poorly with dimension, but you don’t really need to cover all of that volume, you just need each chain balanced against a chain on the other side of the typical set. But if we knew where the boundaries of the typical set were… Anyway, just brainstorming here.

Thanks for the references. I don’t think I’m tall enough to go on those rides!

Yes. This is why I’m arguing that we want to evaluate something like time to ESS = 1000 rather than ESS/second post-warmup.

@andrewgelman who was arguing that we could run multiple chains for a given amount of time each then use stacking to combine them. @betanalpha is saying that this leads to irreducible bias and I think you’re argeeing here.

It not only scales badly with dimension, but it doesn’t take the posterior geometry into account. What we need is balance within the typical set, not balance of starting points. And as you point out,

It might seem like sequential Monte Carlo (SMC) would help, but the problem there is that SMC’s resampling will favor the slow moving chains in the funnel, because those are the ones stuck in the high density, high curvature neck.

Overall I agree, but I do think that ESS/second provides useful local information about the equilibrium behavior of a sampler, which is a component in the time to 1000 effective samples calculation. In particular it will determine just how much can be spread out over multiple chains.

This is related to the fact that SMC solves a fundamentally different problem than MCMC. MCMC tries to generate a new sample from the same distribution as a previous sample, while SMC tries to generate a new sample from a different distribution as the previous sample. The latter is much more difficult and will suffer from not only the MCMC pathologies but also new pathologies such as particle collapse.

I don’t think there’s any single magic measure of efficiency. I like ESS/second because I understand it and it converges to something as simulations go longer. For some applications, time to ESS=1000 makes sense, but for many applications (especially those with clean geometries as we often see in routine applied statistics), ESS=1000 is overkill.