Cross-chain warmup adaptation using MPI

You could try running the R session with mpiexec. Then in the R script you fire off the MPI… that could work.

cmdstanr currently does not deal with MPI. It might come sooner rather than later MPI execution · Issue #117 · stan-dev/cmdstanr · GitHub

1 Like

Just to add on to what Aki said, only multiple chain runs really make sense here. Gotta have multiple chains for Rhat and ESS to be reliable.

Here’s an R code for analysing post-warmup results

mpimodel = cmdstan_model("normal.stan", quiet = FALSE)
datapath = process_data(list(D=2))
system(paste("mpiexec -n 4 --tag-output ./normal sample data file=", datapath, sep=""))
stanfit = rstan::read_stan_csv(c("mpi.0.output.csv","mpi.1.output.csv","mpi.2.output.csv","mpi.3.output.csv"))
leapfrogs = rstan::get_num_leapfrog_per_iteration(stanfit)
mean(leapfrogs)
rstan::monitor(stanfit)

Getting results for warmup (we’d like to get number of iterations and n_leapfrog per iteration) is more difficult. If I add option save_warmup=1 the csv file is not correct

I rstan::read_stan_csv(c("mpi.0.output.csv", "mpi.1.output.csv",  :
  the number of iterations after warmup found (200) does not match iter/warmup/thin from CSV comments (1000,1000,1000,1000)

The number of warmup iterations should be set to the actual adaptation result (ping @yizhang)

cmdstan “read_stan_csv” checks if output iters with predetermined numbering. Let me see if can hack a temp solution.

Just pushed a hacking solution: using awk to replace num_warmup in csv with actual one calculated on the fly. Works on my Mac and Ubuntu, @avehtari would you give a try?

Edit: pushed a much nicer solution by @rok_cesnovar

1 Like

Thanks. That’s helpful.

BTW, in case I stepped on peoples foot – sorry for that.

It’s much clearer now to me what the rationale is and most importantly why.

3 Likes

I’d love to play with this; any guidance on how to set it up with cmdstanr?

@yizhang Do I need do something more than pull the latest commits from branch mpi_warmup_framework and recompile? When I did that the error changed

stanfit <- rstan::read_stan_csv(c("mpi.0.output.csv","mpi.1.output.csv","mpi.2.output.csv","mpi.3.output.csv"))
Error in all_int_eq(warmup) : not all are integers

clone the experimental cmdstan branch

git clone --recursive --branch mpi_warmup_framework https://github.com/stan-dev/cmdstan.git

follow the other instruction in the first post about mpi , compilation and running. If Running radon example from command line works, then

In R

CMDSTANMPIPATH = "~/.cmdstanr/cmdstanmpi"
set_cmdstan_path(CMDSTANMPIPATH)

and the follow the instructions in post Cross-chain warmup adaptation using MPI - #19 by avehtari

Right now there is still a problem with easy access to warmup iteration info, and after that has solved I’ll make a new post (or edit older ones) to have all instruction in one place.

2 Likes

Did you get both latest cmdstan & stan? You can check this by looking at new csv output. The new ones should have new argument max_num_warmup

...
#     num_samples = 1000 (Default)
#     max_num_warmup = 1000 (Default)
...

and print actual num_warmup when warmup terminates

...
# num_warmup = 750
# Adaptation terminated
# Step size = 0.0828001
...

@bbbales2 you mentioned earlier that you had to rebuild cmdstan, is this one of those situations?

t31300-lr010 ~/.cmdstanr/cmdstanmpi % git pull     
remote: Enumerating objects: 89, done.
remote: Counting objects: 100% (65/65), done.
remote: Compressing objects: 100% (20/20), done.
remote: Total 34 (delta 16), reused 31 (delta 13), pack-reused 0
Unpacking objects: 100% (34/34), done.
From https://github.com/stan-dev/cmdstan
   2d22968..b16d18d  mpi_warmup_framework -> origin/mpi_warmup_framework
   26f0e77..c133f17  develop    -> origin/develop
 * [new branch]      feature/809-stanc-args -> origin/feature/809-stanc-args
Fetching submodule stan
remote: Enumerating objects: 45, done.
remote: Counting objects: 100% (45/45), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 28 (delta 17), reused 25 (delta 14), pack-reused 0
Unpacking objects: 100% (28/28), done.
From https://github.com/stan-dev/stan
   a461719..72fe227  develop    -> origin/develop
   57949e3..ad706bc  mpi_warmup_framework -> origin/mpi_warmup_framework
...
# method = sample (Default)
#   sample
#     num_samples = 1000 (Default)
#     max_num_warmup = 1000 (Default)
...
# Adaptation terminated
# Step size = 0.463858
# Diagonal elements of inverse mass matrix:
....

I did

make clean-all
make build -j 4

Somehow num_warmup didn’t get printed. But I just double-checked and it works on both local mac and linux server. Let me dig a bit more.

It works if I clone again. This is now second time that make clean-all seems to fail.
I’ll make R code example for handling warmup iterations tomorrow.

Did the latest commits change the stdout? Here’s what I get for the radon example (I expected the campfire messages at the end of each window but don’t see any):

[1,0]<stdout>:method = sample (Default)
[1,0]<stdout>:  sample
[1,0]<stdout>:    num_samples = 1000 (Default)
[1,0]<stdout>:    num_warmup = 1000 (Default)
[1,0]<stdout>:    save_warmup = 0 (Default)
[1,0]<stdout>:    thin = 1 (Default)
[1,0]<stdout>:    adapt
[1,0]<stdout>:      engaged = 1 (Default)
[1,0]<stdout>:      gamma = 0.050000000000000003 (Default)
[1,0]<stdout>:      delta = 0.80000000000000004 (Default)
[1,0]<stdout>:      kappa = 0.75 (Default)
[1,0]<stdout>:      t0 = 10 (Default)
[1,0]<stdout>:      init_buffer = 75 (Default)
[1,0]<stdout>:      term_buffer = 50 (Default)
[1,0]<stdout>:      window = 25 (Default)
[1,0]<stdout>:      num_cross_chains = 1 (Default)
[1,0]<stdout>:      cross_chain_window = 100 (Default)
[1,0]<stdout>:      cross_chain_rhat = 1.05 (Default)
[1,0]<stdout>:      cross_chain_ess = 50 (Default)
[1,0]<stdout>:    algorithm = hmc (Default)
[1,0]<stdout>:      hmc
[1,0]<stdout>:        engine = nuts (Default)
[1,0]<stdout>:          nuts
[1,0]<stdout>:            max_depth = 10 (Default)
[1,0]<stdout>:        metric = diag_e (Default)
[1,0]<stdout>:        metric_file =  (Default)
[1,0]<stdout>:        stepsize = 1 (Default)
[1,0]<stdout>:        stepsize_jitter = 0 (Default)
[1,0]<stdout>:id = 0 (Default)
[1,0]<stdout>:data
[1,0]<stdout>:  file = radon.data.R
[1,0]<stdout>:init = 2 (Default)
[1,0]<stdout>:random
[1,0]<stdout>:  seed = -1 (Default)
[1,0]<stdout>:output
[1,0]<stdout>:  file = output.csv (Default)
[1,0]<stdout>:  diagnostic_file =  (Default)
[1,0]<stdout>:  refresh = 100 (Default)
[1,0]<stdout>:
[1,1]<stdout>:method = sample (Default)
[1,1]<stdout>:  sample
[1,1]<stdout>:    num_samples = 1000 (Default)
[1,1]<stdout>:    num_warmup = 1000 (Default)
[1,1]<stdout>:    save_warmup = 0 (Default)
[1,1]<stdout>:    thin = 1 (Default)
[1,1]<stdout>:    adapt
[1,1]<stdout>:      engaged = 1 (Default)
[1,1]<stdout>:      gamma = 0.050000000000000003 (Default)
[1,1]<stdout>:      delta = 0.80000000000000004 (Default)
[1,1]<stdout>:      kappa = 0.75 (Default)
[1,1]<stdout>:      t0 = 10 (Default)
[1,1]<stdout>:      init_buffer = 75 (Default)
[1,1]<stdout>:      term_buffer = 50 (Default)
[1,1]<stdout>:      window = 25 (Default)
[1,1]<stdout>:      num_cross_chains = 1 (Default)
[1,1]<stdout>:      cross_chain_window = 100 (Default)
[1,1]<stdout>:      cross_chain_rhat = 1.05 (Default)
[1,1]<stdout>:      cross_chain_ess = 50 (Default)
[1,1]<stdout>:    algorithm = hmc (Default)
[1,1]<stdout>:      hmc
[1,1]<stdout>:        engine = nuts (Default)
[1,1]<stdout>:          nuts
[1,1]<stdout>:            max_depth = 10 (Default)
[1,1]<stdout>:        metric = diag_e (Default)
[1,1]<stdout>:        metric_file =  (Default)
[1,1]<stdout>:        stepsize = 1 (Default)
[1,1]<stdout>:        stepsize_jitter = 0 (Default)
[1,1]<stdout>:id = 0 (Default)
[1,1]<stdout>:data
[1,1]<stdout>:  file = radon.data.R
[1,1]<stdout>:init = 2 (Default)
[1,1]<stdout>:random
[1,1]<stdout>:  seed = -1 (Default)
[1,1]<stdout>:output
[1,1]<stdout>:  file = output.csv (Default)
[1,1]<stdout>:  diagnostic_file =  (Default)
[1,1]<stdout>:  refresh = 100 (Default)
[1,1]<stdout>:
[1,2]<stdout>:method = sample (Default)
[1,2]<stdout>:  sample
[1,2]<stdout>:    num_samples = 1000 (Default)
[1,2]<stdout>:    num_warmup = 1000 (Default)
[1,2]<stdout>:    save_warmup = 0 (Default)
[1,2]<stdout>:    thin = 1 (Default)
[1,2]<stdout>:    adapt
[1,2]<stdout>:      engaged = 1 (Default)
[1,2]<stdout>:      gamma = 0.050000000000000003 (Default)
[1,2]<stdout>:      delta = 0.80000000000000004 (Default)
[1,2]<stdout>:      kappa = 0.75 (Default)
[1,2]<stdout>:      t0 = 10 (Default)
[1,2]<stdout>:      init_buffer = 75 (Default)
[1,2]<stdout>:      term_buffer = 50 (Default)
[1,2]<stdout>:      window = 25 (Default)
[1,2]<stdout>:      num_cross_chains = 1 (Default)
[1,2]<stdout>:      cross_chain_window = 100 (Default)
[1,2]<stdout>:      cross_chain_rhat = 1.05 (Default)
[1,2]<stdout>:      cross_chain_ess = 50 (Default)
[1,2]<stdout>:    algorithm = hmc (Default)
[1,2]<stdout>:      hmc
[1,2]<stdout>:        engine = nuts (Default)
[1,2]<stdout>:          nuts
[1,2]<stdout>:            max_depth = 10 (Default)
[1,2]<stdout>:        metric = diag_e (Default)
[1,2]<stdout>:        metric_file =  (Default)
[1,2]<stdout>:        stepsize = 1 (Default)
[1,2]<stdout>:        stepsize_jitter = 0 (Default)
[1,2]<stdout>:id = 0 (Default)
[1,2]<stdout>:data
[1,2]<stdout>:  file = radon.data.R
[1,2]<stdout>:init = 2 (Default)
[1,2]<stdout>:random
[1,2]<stdout>:  seed = -1 (Default)
[1,2]<stdout>:output
[1,2]<stdout>:  file = output.csv (Default)
[1,2]<stdout>:  diagnostic_file =  (Default)
[1,2]<stdout>:  refresh = 100 (Default)
[1,2]<stdout>:
[1,3]<stdout>:method = sample (Default)
[1,3]<stdout>:  sample
[1,3]<stdout>:    num_samples = 1000 (Default)
[1,3]<stdout>:    num_warmup = 1000 (Default)
[1,3]<stdout>:    save_warmup = 0 (Default)
[1,3]<stdout>:    thin = 1 (Default)
[1,3]<stdout>:    adapt
[1,3]<stdout>:      engaged = 1 (Default)
[1,3]<stdout>:      gamma = 0.050000000000000003 (Default)
[1,3]<stdout>:      delta = 0.80000000000000004 (Default)
[1,3]<stdout>:      kappa = 0.75 (Default)
[1,3]<stdout>:      t0 = 10 (Default)
[1,3]<stdout>:      init_buffer = 75 (Default)
[1,3]<stdout>:      term_buffer = 50 (Default)
[1,3]<stdout>:      window = 25 (Default)
[1,3]<stdout>:      num_cross_chains = 1 (Default)
[1,3]<stdout>:      cross_chain_window = 100 (Default)
[1,3]<stdout>:      cross_chain_rhat = 1.05 (Default)
[1,3]<stdout>:      cross_chain_ess = 50 (Default)
[1,3]<stdout>:    algorithm = hmc (Default)
[1,3]<stdout>:      hmc
[1,3]<stdout>:        engine = nuts (Default)
[1,3]<stdout>:          nuts
[1,3]<stdout>:            max_depth = 10 (Default)
[1,3]<stdout>:        metric = diag_e (Default)
[1,3]<stdout>:        metric_file =  (Default)
[1,3]<stdout>:        stepsize = 1 (Default)
[1,3]<stdout>:        stepsize_jitter = 0 (Default)
[1,3]<stdout>:id = 0 (Default)
[1,3]<stdout>:data
[1,3]<stdout>:  file = radon.data.R
[1,3]<stdout>:init = 2 (Default)
[1,3]<stdout>:random
[1,3]<stdout>:  seed = -1 (Default)
[1,3]<stdout>:output
[1,3]<stdout>:  file = output.csv (Default)
[1,3]<stdout>:  diagnostic_file =  (Default)
[1,3]<stdout>:  refresh = 100 (Default)
[1,3]<stdout>:
[1,0]<stdout>:
[1,0]<stdout>:Gradient evaluation took 0.00114 seconds
[1,0]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 11.4 seconds.
[1,0]<stdout>:Adjust your expectations accordingly!
[1,0]<stdout>:
[1,0]<stdout>:
[1,2]<stdout>:
[1,2]<stdout>:Gradient evaluation took 0.000936 seconds
[1,2]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 9.36 seconds.
[1,2]<stdout>:Adjust your expectations accordingly!
[1,2]<stdout>:
[1,2]<stdout>:
[1,1]<stdout>:
[1,1]<stdout>:Gradient evaluation took 0.001084 seconds
[1,1]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 10.84 seconds.
[1,1]<stdout>:Adjust your expectations accordingly!
[1,1]<stdout>:
[1,1]<stdout>:
[1,3]<stdout>:
[1,3]<stdout>:Gradient evaluation took 0.000926 seconds
[1,3]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 9.26 seconds.
[1,3]<stdout>:Adjust your expectations accordingly!
[1,3]<stdout>:
[1,3]<stdout>:
[1,0]<stdout>:Iteration:    1 / 2000 [  0%]  (Warmup)
[1,2]<stdout>:Iteration:    1 / 2000 [  0%]  (Warmup)
[1,1]<stdout>:Iteration:    1 / 2000 [  0%]  (Warmup)
[1,3]<stdout>:Iteration:    1 / 2000 [  0%]  (Warmup)
[1,2]<stdout>:Iteration:  100 / 2000 [  5%]  (Warmup)
[1,1]<stdout>:Iteration:  100 / 2000 [  5%]  (Warmup)
[1,2]<stdout>:Iteration:  200 / 2000 [ 10%]  (Warmup)
[1,1]<stdout>:Iteration:  200 / 2000 [ 10%]  (Warmup)
[1,2]<stdout>:Iteration:  300 / 2000 [ 15%]  (Warmup)
[1,1]<stdout>:Iteration:  300 / 2000 [ 15%]  (Warmup)
[1,0]<stdout>:Iteration:  100 / 2000 [  5%]  (Warmup)
[1,2]<stdout>:Iteration:  400 / 2000 [ 20%]  (Warmup)
[1,3]<stdout>:Iteration:  100 / 2000 [  5%]  (Warmup)
[1,1]<stdout>:Iteration:  400 / 2000 [ 20%]  (Warmup)
[1,2]<stdout>:Iteration:  500 / 2000 [ 25%]  (Warmup)
[1,0]<stdout>:Iteration:  200 / 2000 [ 10%]  (Warmup)
[1,1]<stdout>:Iteration:  500 / 2000 [ 25%]  (Warmup)
[1,3]<stdout>:Iteration:  200 / 2000 [ 10%]  (Warmup)
[1,2]<stdout>:Iteration:  600 / 2000 [ 30%]  (Warmup)
[1,0]<stdout>:Iteration:  300 / 2000 [ 15%]  (Warmup)
[1,1]<stdout>:Iteration:  600 / 2000 [ 30%]  (Warmup)
[1,3]<stdout>:Iteration:  300 / 2000 [ 15%]  (Warmup)
[1,2]<stdout>:Iteration:  700 / 2000 [ 35%]  (Warmup)
[1,0]<stdout>:Iteration:  400 / 2000 [ 20%]  (Warmup)
[1,1]<stdout>:Iteration:  700 / 2000 [ 35%]  (Warmup)
[1,3]<stdout>:Iteration:  400 / 2000 [ 20%]  (Warmup)
[1,2]<stdout>:Iteration:  800 / 2000 [ 40%]  (Warmup)
[1,0]<stdout>:Iteration:  500 / 2000 [ 25%]  (Warmup)
[1,1]<stdout>:Iteration:  800 / 2000 [ 40%]  (Warmup)
[1,3]<stdout>:Iteration:  500 / 2000 [ 25%]  (Warmup)
[1,2]<stdout>:Iteration:  900 / 2000 [ 45%]  (Warmup)
[1,0]<stdout>:Iteration:  600 / 2000 [ 30%]  (Warmup)
[1,1]<stdout>:Iteration:  900 / 2000 [ 45%]  (Warmup)
[1,3]<stdout>:Iteration:  600 / 2000 [ 30%]  (Warmup)
[1,2]<stdout>:Iteration: 1000 / 2000 [ 50%]  (Warmup)
[1,2]<stdout>:Iteration: 1001 / 2000 [ 50%]  (Sampling)
[1,0]<stdout>:Iteration:  700 / 2000 [ 35%]  (Warmup)
[1,1]<stdout>:Iteration: 1000 / 2000 [ 50%]  (Warmup)
[1,1]<stdout>:Iteration: 1001 / 2000 [ 50%]  (Sampling)
[1,3]<stdout>:Iteration:  700 / 2000 [ 35%]  (Warmup)
[1,2]<stdout>:Iteration: 1100 / 2000 [ 55%]  (Sampling)
[1,0]<stdout>:Iteration:  800 / 2000 [ 40%]  (Warmup)
[1,1]<stdout>:Iteration: 1100 / 2000 [ 55%]  (Sampling)
[1,3]<stdout>:Iteration:  800 / 2000 [ 40%]  (Warmup)
[1,2]<stdout>:Iteration: 1200 / 2000 [ 60%]  (Sampling)
[1,0]<stdout>:Iteration:  900 / 2000 [ 45%]  (Warmup)
[1,3]<stdout>:Iteration:  900 / 2000 [ 45%]  (Warmup)
[1,1]<stdout>:Iteration: 1200 / 2000 [ 60%]  (Sampling)
[1,2]<stdout>:Iteration: 1300 / 2000 [ 65%]  (Sampling)
[1,0]<stdout>:Iteration: 1000 / 2000 [ 50%]  (Warmup)
[1,0]<stdout>:Iteration: 1001 / 2000 [ 50%]  (Sampling)
[1,1]<stdout>:Iteration: 1300 / 2000 [ 65%]  (Sampling)
[1,3]<stdout>:Iteration: 1000 / 2000 [ 50%]  (Warmup)
[1,3]<stdout>:Iteration: 1001 / 2000 [ 50%]  (Sampling)
[1,2]<stdout>:Iteration: 1400 / 2000 [ 70%]  (Sampling)
[1,0]<stdout>:Iteration: 1100 / 2000 [ 55%]  (Sampling)
[1,1]<stdout>:Iteration: 1400 / 2000 [ 70%]  (Sampling)
[1,3]<stdout>:Iteration: 1100 / 2000 [ 55%]  (Sampling)
[1,2]<stdout>:Iteration: 1500 / 2000 [ 75%]  (Sampling)
[1,0]<stdout>:Iteration: 1200 / 2000 [ 60%]  (Sampling)
[1,1]<stdout>:Iteration: 1500 / 2000 [ 75%]  (Sampling)
[1,3]<stdout>:Iteration: 1200 / 2000 [ 60%]  (Sampling)
[1,2]<stdout>:Iteration: 1600 / 2000 [ 80%]  (Sampling)
[1,0]<stdout>:Iteration: 1300 / 2000 [ 65%]  (Sampling)
[1,1]<stdout>:Iteration: 1600 / 2000 [ 80%]  (Sampling)
[1,3]<stdout>:Iteration: 1300 / 2000 [ 65%]  (Sampling)
[1,2]<stdout>:Iteration: 1700 / 2000 [ 85%]  (Sampling)
[1,0]<stdout>:Iteration: 1400 / 2000 [ 70%]  (Sampling)
[1,1]<stdout>:Iteration: 1700 / 2000 [ 85%]  (Sampling)
[1,3]<stdout>:Iteration: 1400 / 2000 [ 70%]  (Sampling)
[1,2]<stdout>:Iteration: 1800 / 2000 [ 90%]  (Sampling)
[1,0]<stdout>:Iteration: 1500 / 2000 [ 75%]  (Sampling)
[1,1]<stdout>:Iteration: 1800 / 2000 [ 90%]  (Sampling)
[1,3]<stdout>:Iteration: 1500 / 2000 [ 75%]  (Sampling)
[1,0]<stdout>:Iteration: 1600 / 2000 [ 80%]  (Sampling)
[1,2]<stdout>:Iteration: 1900 / 2000 [ 95%]  (Sampling)
[1,1]<stdout>:Iteration: 1900 / 2000 [ 95%]  (Sampling)
[1,3]<stdout>:Iteration: 1600 / 2000 [ 80%]  (Sampling)
[1,0]<stdout>:Iteration: 1700 / 2000 [ 85%]  (Sampling)
[1,2]<stdout>:Iteration: 2000 / 2000 [100%]  (Sampling)
[1,2]<stdout>:
[1,2]<stdout>: Elapsed Time: 20.1273 seconds (Warm-up)
[1,2]<stdout>:               9.63335 seconds (Sampling)
[1,2]<stdout>:               29.7606 seconds (Total)
[1,2]<stdout>:
[1,1]<stdout>:Iteration: 2000 / 2000 [100%]  (Sampling)
[1,3]<stdout>:Iteration: 1700 / 2000 [ 85%]  (Sampling)
[1,1]<stdout>:
[1,1]<stdout>: Elapsed Time: 20.5619 seconds (Warm-up)
[1,1]<stdout>:               9.40207 seconds (Sampling)
[1,1]<stdout>:               29.9639 seconds (Total)
[1,1]<stdout>:
[1,0]<stdout>:Iteration: 1800 / 2000 [ 90%]  (Sampling)
[1,3]<stdout>:Iteration: 1800 / 2000 [ 90%]  (Sampling)
[1,0]<stdout>:Iteration: 1900 / 2000 [ 95%]  (Sampling)
[1,3]<stdout>:Iteration: 1900 / 2000 [ 95%]  (Sampling)
[1,0]<stdout>:Iteration: 2000 / 2000 [100%]  (Sampling)
[1,0]<stdout>:
[1,0]<stdout>: Elapsed Time: 23.0489 seconds (Warm-up)
[1,0]<stdout>:               7.9792 seconds (Sampling)
[1,0]<stdout>:               31.0281 seconds (Total)
[1,0]<stdout>:
[1,3]<stdout>:Iteration: 2000 / 2000 [100%]  (Sampling)
[1,3]<stdout>:
[1,3]<stdout>: Elapsed Time: 23.4612 seconds (Warm-up)
[1,3]<stdout>:               7.70702 seconds (Sampling)
[1,3]<stdout>:               31.1683 seconds (Total)
[1,3]<stdout>:

Likely there’s some problem with your compilation. Seems you are running a sequential version. The latest commits should give you max_num_warmup = 1000 (Default) instead of num_warmup = 1000 (Default), and you’ll see cross chain window adaptation in stdout. If there’s MPI_ADAPTED_WARMUP=1 in your make/local, check in the compilation stdout if compiler is mpicxx instead of g++ or clang++.

Hm. I tried going back to the beginning of my install script, and now I don’t get any output on the stdout when running the model and the radon processes are pinning my cpu at max. Below is my install script (ubuntu 19.10); see anything obviously wrong?

#install openmpi
sudo apt install libopenmpi-dev #installs headers at `/usr/lib/x86_64-linux-gnu/openmpi/include`

#clone the campfire branch
git clone --recursive --branch mpi_warmup_framework https://github.com/stan-dev/cmdstan.git cmdstan_campfire

#navigate into the repo
cd cmdstan_campfire

#add make/local
cat <<EOT >> make/local
LDLIBS+=-lpthread
STAN_MPI=true
CXX=mpicxx
TBB_CXX_TYPE=gcc
MPI_ADAPTED_WARMUP = 1
CXXFLAGS += -isystem /usr/lib/x86_64-linux-gnu/openmpi/include
EOT

#clean and build cmdstan stuff
make clean-all
make build -j $(nproc)

#make & run the radon example
make examples/radon/radon
cd examples/radon
mpiexec -n 4 --tag-output ./radon sample data file=radon.data.R

And in case it helps, here’s the output from make examples/radon/radon:

--- Compiling, linking C++ code ---
mpicxx -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -DSTAN_LANG_MPI -DMPI_ADAPTED_WARMUP -std=c++1y -D_REENTRANT -Wno-sign-compare    -Wno-delete-non-virtual-dtor -I stan/lib/stan_math/lib/tbb_2019_U8/include -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.3 -I stan/lib/stan_math/lib/boost_1.69.0 -I stan/lib/stan_math/lib/sundials_4.1.0/include    -DBOOST_DISABLE_ASSERTS   -DSTAN_MPI   -c  -x c++ -o examples/radon/radon.o examples/radon/radon.hpp
mpicxx -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -DSTAN_LANG_MPI -DMPI_ADAPTED_WARMUP -std=c++1y -D_REENTRANT -Wno-sign-compare    -Wno-delete-non-virtual-dtor -I stan/lib/stan_math/lib/tbb_2019_U8/include -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.3 -I stan/lib/stan_math/lib/boost_1.69.0 -I stan/lib/stan_math/lib/sundials_4.1.0/include    -DBOOST_DISABLE_ASSERTS   -DSTAN_MPI        -Wl,-L,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/boost_1.69.0/stage/lib" -Wl,-rpath,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/boost_1.69.0/stage/lib" -Wl,-L,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/tbb"  examples/radon/radon.o src/cmdstan/main.o -lpthread         stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_nvecserial.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_cvodes.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_idas.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_kinsol.a stan/lib/stan_math/lib/boost_1.69.0/stage/lib/libboost_serialization.so stan/lib/stan_math/lib/boost_1.69.0/stage/lib/libboost_mpi.so stan/lib/stan_math/stan/math/prim/arr/functor/mpi_cluster_inst.o stan/lib/stan_math/lib/tbb/libtbb.so.2 -o examples/radon/radon

You cannot use STAN_MPI. Sorry I wasn’t clear. MPI_ADAPTED_WARMUP = 1 would suffice.

There’s a longer version of this answer but let’s not diverge.

1 Like

Thanks! Works great now. I notice that when I use 4 chains, it takes 2 windows (200 iterations) to warm up on the radon example, but when I use 6, it only takes 1 window (100 iterations). Presumably this is expected given more info from more chains?

Also, I’m on a 6-core hyperthreading cpu, so I thought I could go up to 12 chains, but when I try for 7 or greater I get the error There are not enough slots available in the system to satisfy the 12 slots that were requested by the application. Is it expected that the number of chains is limited by the physical core count and not the logical core count?

Apparently yes, this is the default behaviour for MPI. To enable using logical cores rather than physical, the --use-hwthread-cpus argument is needed:

mpiexec -n $(nproc) --use-hwthread-cpus --tag-output ./radon sample data file=radon.data.R"

But this ends up making things even slower. So I guess the default restriction to physical cores is there for a reason.