Cross-chain warmup adaptation using MPI

I’ve been working with @bbbales2 and @avehtari to materialize the campfire algorithm using MPI. The implementation is experimental, and we’d like to hear from you how the algorithm perform for your models.

Setup

The implementation touches all three Stan components(cmdstan, stan, and stan_math). One can retrieve them all using

git clone --recursive --branch mpi_warmup_framework https://github.com/stan-dev/cmdstan.git

MPI is required to compile and run. @bbbales2 points out that installing mpi in Ubuntu was:

sudo apt install libopenmpi-dev

And the include path was:

/usr/include/mpi

Compile

Add

MPI_ADAPTED_WARMUP = 1
CXXFLAGS += -isystem /path/to/mpi/headers

in your make/local. Make sure STAN_MPI is not used here. Note that you may also need

TBB_CXX_TYPE= # cpp compiler

as required by TBB, though it’s irrelevant to this discussion.

In cmdstan, compile the included radon model by

make examples/radon/radon

Run

Depends the MPI library(MPICH or OpenMPI, we haven’t tested it on others)

mpiexec -n 4 -l ./radon sample data file=radon.data.R #MPICH
mpiexec -n 4 --tag-output ./radon sample data file=radon.data.R #OpenMPI

runs the model with 4 processes. By default this would generate 4 chains that communicate with each other during warmup. As shown in the original proposal, the chains would exchange information in order to calculate rhat and ESS, and they are used to determine the efficacy of warmup so that sampling can be started.

Output

stdout would be like

[2] Iteration:    1 / 2000 [  0%]  (Warmup)
[1] Iteration:    1 / 2000 [  0%]  (Warmup)
[3] Iteration:    1 / 2000 [  0%]  (Warmup)
[0] Iteration:    1 / 2000 [  0%]  (Warmup)
[1] Iteration:  100 / 2000 [  5%]  (Warmup)
[0] Iteration:  100 / 2000 [  5%]  (Warmup)
[3] Iteration:  100 / 2000 [  5%]  (Warmup)
[2] Iteration:  100 / 2000 [  5%]  (Warmup)
[0] iteration: 100 window: 1 / 1 Rhat: 1.02 ESS: 12.39
[0] Iteration:  200 / 2000 [ 10%]  (Warmup)
[2] Iteration:  200 / 2000 [ 10%]  (Warmup)
[1] Iteration:  200 / 2000 [ 10%]  (Warmup)
[3] Iteration:  200 / 2000 [ 10%]  (Warmup)
[0] iteration: 200 window: 1 / 2 Rhat: 1.01 ESS: 24.14
[0] iteration: 200 window: 2 / 2 Rhat: 1.02 ESS: 39.90
[3] Iteration:  300 / 2000 [ 15%]  (Warmup)
[1] Iteration:  300 / 2000 [ 15%]  (Warmup)

Each line is prefixed with MPI proc ID. Every cross_chain_window iterations(see below) convergence test output shows the iteration, window, Rhat, ESS. Test is passed when Rhat<cross_chain_rhat and ESS>cross_chain_ess(see below).

The above run generates 4 output files. With output file=aaa.csv, the MPI output would be mpi.0.aaa.csvmpi.3.aaa.csv, similar to 4 separated regular sampling outputs.

cmdstan arguments

./radon help-all | grep "cross_chain.*=" -A 4 

should show four additional arguments to cmdstan under adapt category

        num_cross_chains=<unsigned int>
          Number of chains in cross-chain warmup iterations
          Valid values: num_cross_chains < number of MPI processes
          Defaults to number of MPI processes

--
        cross_chain_window=<unsigned int>
          Window size for cross-chain warmup
          Valid values: All
          Defaults to 100

--
        cross_chain_rhat=<double>
          Target Rhat for cross-chain warmup
          Valid values: 0.8 < cross_chain_rhat
          Defaults to 1.05

--
        cross_chain_ess=<unsigned int>
          Target ESS for cross-chain warmup
          Valid values: All
          Defaults to 50

In particular

  • Every num_cross_chains iters the convergence test is performed.
  • The number of chains defaults to the nb. of MPI procs. One can specifies it manually by using num_cross_chains but it must be less than or equal to nb. of MPI procs.
  • When warmup is deemed sufficient, each chain’s stepsize would be overwritten with chain average stepsize, and the mass matrix would be re-calculated using draws from all the chains.

Caveats

  • rhat and ESS in the tests are based on lp__ only.
  • Currently ESS is not calculated as in Stan but only based on single chain draws. Cross-chain version is on the way.
  • Currently only implemented for adapt diag metrcs. Dense metrics on the way.

We’d like to hear your comments and feedback!

9 Likes

This is really cool! Looking at the components, how would you want to break up the PRs? i.e. how many PRs and what is grouped together within each of math/stan/cmdstan? It might be nice to abstract the messaging part in the Stan repo so that it would generally work with sequential or with tbb

Honestly I’m not planning that far at this point. Roughly the implementation is broken into

  • math: MPI infrastructure that is independent of map_rect's.
  • stan: nuts samplers inherit a second adapter that deals with cross-chain warmup.
  • cmdstan: arguments.

When I thought about this last time I figured that it would be very interesting to start without any parallelization thing in the way. A few Q:

  • How much do I gain in efficiency by combining different chains? There should be an efficiency gain, but I wanted to know how much there is to gain in order to gauge it it is worth the trouble.
  • Coding up a first implementation without threads / MPI makes it a lot easier to make progress.
  • It would be very good if the adaptive ESS based (as I understand) warmup can be run without any MPI dependency. Requiring people to setup MPI just for the adaptive warmup is not nice, I think.

Skimming over the code it appears to me that MPI things are heavily poured into the code which does make it hard for me to read and I fear that making a thread implementation will start from scratch basically. If you could give it a thought on how to make things modular at some point (so that a thread version is easy to code up) that would be great.

Is your plan to refactor the stan-math MPI sub-system?

1 Like

The biggest expected gain is to have adaptive warmup length. Going from the default 1000+1000 iterations to adaptive+1000 means that there is always less than 50% reduction in the number of iterations, but increased confidence that the warmup is long enough also for hard problems. This can bring much bigger time saving as there is then less often need to re-run with longer adaptation, e.g. we can go from 1000+1000+2000+1000 to adpatation+1000 and get more than 50% reduction in the number of iterations.

The other possible gain is if the chains all mix slowly the combined variance estimate is bigger than within chain variance estimates and then the mass matrix will be adapted more quickly. This can lead to smaller treedepths in the beginning of warmup so that the reduction of total number of leapfrog steps in warmup and actual sampling can be over 50% (although likely not often).

The campfire algorithm includes also automatic selection of mass matrix to be diagonal vs low rank vs dense, which can bring order of magnitude speedup (reduced number of leapfrog steps per iteration) if there are strong posterior correlations.

It sounds as if we definitely want an adaptive warmup without cross-chain communication as an option for users. Thus, no need for MPI or threading here leading to much simpler code and thus a faster adoption in Stan. MPI and threading with parallel chains will require a decent overhall of the stan algorithms library given that you then sample multiple chains and need to output these. This has never been done before.

I could imagine that adaptive warmup of one chain + starting with this warmup result multiple chains can be implemented quickly using our current framework and can land in Stan rather soon.

From your description many problems this adaptive thing is already the main gain. Is that right?

How much is the cross-chain thing giving us in terms of efficiency for various problems? Do we know that?

I’m afraid I’m not able to parse correctly, so I’m guessing what you ask. Yes, campfire is useful and it uses several chains. We have preliminary results, and Yi made an implementation (he working on mpi anyway) so that we can make further experiments more easily (with posteriordb) and to let other people to test this with their favorite examples. Campfire is not going to be merged before much more experiments.

I know that you are super-worried about the technical details of the parallel implementation, but right now I’m just happy that we have some implementation so that we can learn more about the gains and requirements (e.g. amount of between chain communication).

Cool!

1 Like

Sorry if unclear: I am saying that you have two separate things:

  1. adaptive warmup which works on a single chain
  2. cross-chain warmup techniques

Point number 1 is independent of point number 2. The adaptive warmup can be integrated a lot easier in the current services framework of stan as compared to 2. This is why I would suggest to implement an adaptive warmup approach without the necessity for cross-chain communication. This will likely land a lot faster in Stan than doing cross-chain directly.

Moreover, I understood that the adaptive warmup thing will give us already most of the benefits. Cross-chain things - as I understand you - are only useful for more complicated models. This would be one more argument to get the low hanging fruit of adaptive warmup earlier rather than later.

Well, I do not agree with the decision for MPI nor do I understand it - but it’s your choice. However, I would actually strongly recommend you guys to ditch MPI and threading for the moment being. Then you do not need to compromise on code complexity nor on making any compromises for threading or MPI (like limiting the communication to avoid MPI bottlenecks). I would first implement this using a single process only. This is the easiest to program and follow along. Basically you should evaluate the question “How do I get X number of independent samples using a single core in the shortest possible time?” Should we run one adaptive chain? Should we run multiple adaptive ones with cross-chain (which would be run in turn to start with)? Once you know the advantages of cross-chain, you should implement it with some parallelism - ideally with MPI and threading being supported from the start.

This would have been my approach. When I tried cross-chain I did not see obvious mega-speeups, but I could have easily messed it up.

It’s great that you guys drive this forward. I am just sharing my thoughts as to what would lead to a faster inclusion in Stan from my understanding.

2 Likes

Sorry if unclear, but these are not separate. New low-rank is a separate thing, but the adaptive number of iterations is cross-chain.

I remember referring this as prototyping during weekly meeting, intended to collect feeback to evaluate campfire, in particular its efficacy adequacy on warmup. With this in mind I’m blessed with not to give any thought to make it modular.

Does campfire work on a single chain? That’s not how I understand @bbbales2’s work.
EDIT: didn’t notice @avehtari 's comment. 1&2 are both needed in campfire.

It seems we mean different things by cross-chain. For the campfire, the cross-chain is the same same as between-chain in Rhat and ESS, which is not related to parallel implementation. Also please, be specific what you mean by speedup: clock time, ESS/s, less leapfrogs per iteration or less iterations?

There is no sense to include it before we have tested it.

2 Likes

Hmm… so what does campfire do if I start just one chain? Does that make any sense?

I would evaluate all this by keeping the complexity down as much as I can. That would mean for me: Run 1 chain with standard warmup and compute the ESS / wall clock time on a single core for a fixed number of iterations during the sampling phase. Then I compare this to the new thing which I run for a fixed number X of iterations during the sampling phase. As you want to gain from using more chains, I would do this with a pattern of running 1 chain with X number of iterations during sampling, then 2 chains with X/2 number iterations during sampling… and so on. The quantity to compare one another would be ESS per wall clock on a single core. Throwing in multiple cores into all of this makes the comparison just harder as there are more moving parts.

So, the MPI thing you created is overkill from my perspective. You can still use it for the above benchmarks by forcing the MPI processes to share just a single core. I recall that under Linux the respective process restrictions can be enforced with “cgroups”, but I never worked with that (it is possible though).

If you are not worried about efficacy (which I read as small wall time - or large ESS / wall clock time), then why did you go down the route of MPI? Throwing MPI into it makes it look to me as if this is your intended final thing. Now I am a little bit confused. What is are your trying to achieve with all of this? I mean, are we not trying to make warmup a lot faster / more efficient? The current warmup is very robust on my problems - it’s just very, very conservative and as a result of that slow. So what are you looking for?

I definitely do not intend to derailing on MPI or any other between core parallelism - to me this evaluation should be done without multiple cores to keep complexity down. To put it differently: Why do you guys think that we need multi-core capabilities to evaluate the proposed algorithm?

1 Like

As much sense as anytime running just one chain. You have much less information available for diagnostics. So we don’t recommend it and testing with one chain has very low priority.

Our ESS computation uses Rhat which is best when running more than 1 chain. Wall clock time is more complex than iterations and n_leapfrog.

No, we are not yet comparing anything per wall clock time because that is one of the most complex thing possible in computers. We want to keep the complexity down as much as we can, and thus we compare

  • change in number of target evaluations (n_leapfrog) in warmup
  • change in number of iterations in warmup
  • change in number of target evaluations in actual sampling
  • change in number of iterations in actual sampling
  • change in ESS / number of target evaluations in actual sampling
  • change in ESS / iterations in actual sampling

These do not depend on technical implementation issues which may have very big effect on wall clock time. After we understand the behavior for the above quantities, we can fix algorithm details and then start to optimize for wall clock time.

Can we start another thread for MPI vs. others and discuss here about adaptive warmup? For testing whether the proposed warmup is useful it doesn’t matter how the different chains are run.

I don’t care wall clock time at this point. It’s complete distraction at this point.

I feel I’m repeating myself, but

  • change in number of target evaluations (n_leapfrog) in warmup
  • change in number of iterations in warmup
  • change in number of target evaluations in actual sampling
  • change in number of iterations in actual sampling
  • change in ESS / number of target evaluations in actual sampling
  • change in ESS / iterations in actual sampling

It’s great if you are happy with the current warmup, and it will be an option in the future, too. If you are happy with the current warmup, there is not much interesting here for you.

Sorry I’ve have not been clear enough: I think we don’t need multi-core capabilities and I don’t care at this point what the technical implementation is as far as we can test it. I assume the multi-whatever implementation (someone else can decide later what that implementation is) makes it easier to chains to communicate and single core computation could be even more complex, but I don’t care in the context of these tests. We have now something which can be used for testing an algorithm without concerning wall clock time or library dependencies and we can worry about those later.

I was hoping that in this thread we would discuss more about 1) the algorithm, like how to minimize the computation and communication in repeated Rhat and ESS computations, and 2) results from people running experiments with posteriors they know are difficult.

1 Like

Well - the current warmup is slow. I basically endup running very short warmups most of the time. So making this faster is great.

The first post was about “we look for feedback” - that was too unspecific for me. Your list of what you are looking for is making this a lot more concrete and it really helps. Are there R snippets flying around which extract from a CmdStanR fit what you are looking for?

Excellent

Good question. I’ll prepare them but before that @stevebronder, @rok_cesnovar or @jonah, how do I run

mpiexec -n 4 --tag-output ./radon sample data file=radon.data.R

Because if I now do

modelc1 = cmdstan_model("bernoulli_logit_glm_rhs.stan", quiet = FALSE)
fit1 = modelc1$sample(data = data)`

it compiles it correctly but doesn’t run it with mpi.

You could try running the R session with mpiexec. Then in the R script you fire off the MPI… that could work.

cmdstanr currently does not deal with MPI. It might come sooner rather than later https://github.com/stan-dev/cmdstanr/issues/117

1 Like

Just to add on to what Aki said, only multiple chain runs really make sense here. Gotta have multiple chains for Rhat and ESS to be reliable.

Here’s an R code for analysing post-warmup results

mpimodel = cmdstan_model("normal.stan", quiet = FALSE)
datapath = process_data(list(D=2))
system(paste("mpiexec -n 4 --tag-output ./normal sample data file=", datapath, sep=""))
stanfit = rstan::read_stan_csv(c("mpi.0.output.csv","mpi.1.output.csv","mpi.2.output.csv","mpi.3.output.csv"))
leapfrogs = rstan::get_num_leapfrog_per_iteration(stanfit)
mean(leapfrogs)
rstan::monitor(stanfit)

Getting results for warmup (we’d like to get number of iterations and n_leapfrog per iteration) is more difficult. If I add option save_warmup=1 the csv file is not correct

I rstan::read_stan_csv(c("mpi.0.output.csv", "mpi.1.output.csv",  :
  the number of iterations after warmup found (200) does not match iter/warmup/thin from CSV comments (1000,1000,1000,1000)

The number of warmup iterations should be set to the actual adaptation result (ping @yizhang)

cmdstan “read_stan_csv” checks if output iters with predetermined numbering. Let me see if can hack a temp solution.