MPI framework for parallelized warmups

yizhang · November 4, 2019, 3:35pm

Because of this independence, maybe the actual adaptation strategy should be the first thing to design.

stevebronder · November 4, 2019, 5:17pm

I def agree with this, think once we figure out how to have users select to run multiple chains in a stan program and do adaption with that we will have a better scope on how to do it in parallel.

i.e. with 4 chains and 16 cores, would we want to do something like have 4 chains in adaptation/warmup each with 4 cores available to them, but then after warmup we may just want to replicate those 4 chains 4 times and then run 16 chains at once? Dunno if we need to answer that here and now but there are patterns to think about both at the adaption strategy and parallel level where breaking them up would be simpler

betanalpha · November 5, 2019, 3:30am

Changing adaptation is more subtle than just halving the length of the chains due to the initial (for finding the typical set without any adaptation) and terminal (for finalizing the step size) windows and the multiplicatively expanding adaptation windows between (for aggregating variance information to inform updates to the inverse metric components and updating the step size after the previous update).

Sure, just be aware that the approach that evolves with your prototype might not be chosen so there’s the risk of redundant work. Developing that prototype will definitely help inform the design you want, but like any exploratory development it brings with it the risk of not ultimately being used.

The exact service routes will have to change a bit. Firstly they will have to be able to run multiple chains, then break up warmup into chunks to allow intrachain communication.

Which to prioritize is totally up to the person developing the design.

Personally I think that we want to prioritize chain parallelization first because it is beneficial in of itself.

Service routes that run multiple chains and create multiple streams of output.
Parallelizing those routes, allowing all of the interfaces consistent parallelization without having to implement it themselves.
Progressive warmup to allow for intra chain communication.
Adaptive warmup through that intra chain communication.

Bob_Carpenter · November 27, 2019, 8:23pm

The issue isn’t ease of coding or when the tool was designed, it’s that threading has a hard limit on parallelization based on the number of cores on a single machine.

Three things in phases during adaptation: (I) we find the typical set [what is usually done through burn-in even if there’s no adaptation], (II) mass matrix estimation, and (III) step size estimation.

@bbbales2 has an R package out that does adaptation adaptively in chunks with restarting.

I agree that in terms of statistic efficiency, we don’t need to worry about communication. But for actual implementation, it tends to be the dominating factor.

Mine, too. But I think the answer is the same as we have for any algorithmic improvement, along with whatever consideration there is for portability and pain of coding.

I don’t even know how to build a prototype without at least a rough design doc. But I realize different programmers approach problems differently.

I don’t think speed of implementation should be our primary criterion in selecting an implementation. On the other hand, someone’s going to need to build something to make the feature real. And multi-threading would be better than nothing.

I’m not sure what you mean by “route” here. The signature of the service functions only have to accomodate multiple chain config and output. Or are you imagining arguments configuring the parallelism beyond number of chains and back end config of TBB or MPI?

I would also really like to consider continuous adaptation. That is, something that doesn’t run on discrete blocks in phases I, II, and III as we’re doing now, but integrated them into one bigger, smoother adaptation.

betanalpha · December 3, 2019, 3:42pm

Only if there is a lot of communication – parallelizing the existing adaptation routine where updates are made only at the end of a window would require very little communication relative to the amount of computation done on each thread/process/etc (is there a general name for abstract unit that processes its own set of instructions?).

Right now the service functions just run chunks of iterations without enough flexibility to be able to communicate. Either those have to be refactored to allow communication, which new functions wrapping everything to be exposed to the user, or completely rewritten to integrate the parallelization.

Ultimately we need a dispatch function that is able to run parallel chains with intermittent communication and at some level that means pushing the existing service functionality into to single-chain-segment functions that the dispatch function can call.

Not quite.

Finding the typical set is not the entirety of non-adaptive “burnin”. To equilibrate a Markov chain has to run long enough to find and then explore the stationary typical set enough to washout any influence of the initialization, and nontrivial bias in MCMC estimators, which takes a little while.

Phase I of the current adaptation just tries to give time for the chain to find the stationary typical set without trying to worry about equilibrating into it, so it’s only a subset of “burnin”.

II and III are not separate phases but rather components of a single phase that tries to finish the equilibration while adapting the configuration of the dynamic HMC transition. Because this phase beings while the MCMC estimators are still biased (not to mention highly variable) the metric adaptation proceeds in windows (collect running estimator within each window then update at the end of the window) to stabilize the estimation. The step size adaptation proceeds continuously because it’s less vulnerable to those initial errors – in particular because the adapt statistic is collected over the entire trajectory it contains more information that leads to less variability once we’ve found the typical set.

As I noted above only the metric adaptation is discrete. The estimation of quantities like target variances is error-prone early on which makes continuous adaptation really, really hard. The estimators have to be very carefully regularized to avoid pushing the sampler into a bad configuration early on that then prevents sufficiently effective exploration to collect enough information to improve the estimators. Going to the current windowed adaptation significantly improved the stability of the adaptation way back in the day.

By all means experiment with more continuous strategies – just remember that you’re fighting against an equilibrating Markov chain and not some stationary process, so most online adaptation schemes end up pretty fragile.

Bob_Carpenter · December 7, 2019, 4:16am

Thanks for the clarification, @betanalpha.

Topic		Replies	Views
Cross-chain warmup adaptation using MPI Algorithms mcmc	91	4950	July 2, 2021
Preliminary benchmark: incremental and adaptive parallel warm-up Publicity warm-up	4	906	November 9, 2021
New adaptive warmup proposal (looking for feedback)! Algorithms	50	4174	July 31, 2020
Pooled warmup Developers stan	7	893	December 3, 2019
Is having 4 chains with length of 5000 the same as having 8 chains with length of 2500? Modeling	8	1574	January 28, 2022

MPI framework for parallelized warmups

Related topics