Cross-chain warmup adaptation using MPI

yizhang · February 4, 2020, 9:08am

Just pushed a hacking solution: using awk to replace num_warmup in csv with actual one calculated on the fly. Works on my Mac and Ubuntu, @avehtari would you give a try?

Edit: pushed a much nicer solution by @rok_cesnovar

wds15 · February 4, 2020, 3:48pm

Thanks. That’s helpful.

BTW, in case I stepped on peoples foot – sorry for that.

It’s much clearer now to me what the rationale is and most importantly why.

mike-lawrence · February 5, 2020, 3:22pm

I’d love to play with this; any guidance on how to set it up with cmdstanr?

avehtari · February 5, 2020, 3:41pm

@yizhang Do I need do something more than pull the latest commits from branch mpi_warmup_framework and recompile? When I did that the error changed

stanfit <- rstan::read_stan_csv(c("mpi.0.output.csv","mpi.1.output.csv","mpi.2.output.csv","mpi.3.output.csv"))
Error in all_int_eq(warmup) : not all are integers

avehtari · February 5, 2020, 3:47pm

clone the experimental cmdstan branch

git clone --recursive --branch mpi_warmup_framework https://github.com/stan-dev/cmdstan.git

follow the other instruction in the first post about mpi , compilation and running. If Running radon example from command line works, then

In R

CMDSTANMPIPATH = "~/.cmdstanr/cmdstanmpi"
set_cmdstan_path(CMDSTANMPIPATH)

and the follow the instructions in post Cross-chain warmup adaptation using MPI - #19 by avehtari

Right now there is still a problem with easy access to warmup iteration info, and after that has solved I’ll make a new post (or edit older ones) to have all instruction in one place.

yizhang · February 5, 2020, 3:55pm

Did you get both latest cmdstan & stan? You can check this by looking at new csv output. The new ones should have new argument max_num_warmup

...
#     num_samples = 1000 (Default)
#     max_num_warmup = 1000 (Default)
...

and print actual num_warmup when warmup terminates

...
# num_warmup = 750
# Adaptation terminated
# Step size = 0.0828001
...

@bbbales2 you mentioned earlier that you had to rebuild cmdstan, is this one of those situations?

avehtari · February 5, 2020, 4:11pm

t31300-lr010 ~/.cmdstanr/cmdstanmpi % git pull     
remote: Enumerating objects: 89, done.
remote: Counting objects: 100% (65/65), done.
remote: Compressing objects: 100% (20/20), done.
remote: Total 34 (delta 16), reused 31 (delta 13), pack-reused 0
Unpacking objects: 100% (34/34), done.
From https://github.com/stan-dev/cmdstan
   2d22968..b16d18d  mpi_warmup_framework -> origin/mpi_warmup_framework
   26f0e77..c133f17  develop    -> origin/develop
 * [new branch]      feature/809-stanc-args -> origin/feature/809-stanc-args
Fetching submodule stan
remote: Enumerating objects: 45, done.
remote: Counting objects: 100% (45/45), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 28 (delta 17), reused 25 (delta 14), pack-reused 0
Unpacking objects: 100% (28/28), done.
From https://github.com/stan-dev/stan
   a461719..72fe227  develop    -> origin/develop
   57949e3..ad706bc  mpi_warmup_framework -> origin/mpi_warmup_framework

...
# method = sample (Default)
#   sample
#     num_samples = 1000 (Default)
#     max_num_warmup = 1000 (Default)
...
# Adaptation terminated
# Step size = 0.463858
# Diagonal elements of inverse mass matrix:
....

I did

make clean-all
make build -j 4

yizhang · February 5, 2020, 4:58pm

Somehow num_warmup didn’t get printed. But I just double-checked and it works on both local mac and linux server. Let me dig a bit more.

avehtari · February 5, 2020, 5:48pm

It works if I clone again. This is now second time that make clean-all seems to fail.
I’ll make R code example for handling warmup iterations tomorrow.

mike-lawrence · February 5, 2020, 8:29pm

Did the latest commits change the stdout? Here’s what I get for the radon example (I expected the campfire messages at the end of each window but don’t see any):

[1,0]<stdout>:method = sample (Default)
[1,0]<stdout>:  sample
[1,0]<stdout>:    num_samples = 1000 (Default)
[1,0]<stdout>:    num_warmup = 1000 (Default)
[1,0]<stdout>:    save_warmup = 0 (Default)
[1,0]<stdout>:    thin = 1 (Default)
[1,0]<stdout>:    adapt
[1,0]<stdout>:      engaged = 1 (Default)
[1,0]<stdout>:      gamma = 0.050000000000000003 (Default)
[1,0]<stdout>:      delta = 0.80000000000000004 (Default)
[1,0]<stdout>:      kappa = 0.75 (Default)
[1,0]<stdout>:      t0 = 10 (Default)
[1,0]<stdout>:      init_buffer = 75 (Default)
[1,0]<stdout>:      term_buffer = 50 (Default)
[1,0]<stdout>:      window = 25 (Default)
[1,0]<stdout>:      num_cross_chains = 1 (Default)
[1,0]<stdout>:      cross_chain_window = 100 (Default)
[1,0]<stdout>:      cross_chain_rhat = 1.05 (Default)
[1,0]<stdout>:      cross_chain_ess = 50 (Default)
[1,0]<stdout>:    algorithm = hmc (Default)
[1,0]<stdout>:      hmc
[1,0]<stdout>:        engine = nuts (Default)
[1,0]<stdout>:          nuts
[1,0]<stdout>:            max_depth = 10 (Default)
[1,0]<stdout>:        metric = diag_e (Default)
[1,0]<stdout>:        metric_file =  (Default)
[1,0]<stdout>:        stepsize = 1 (Default)
[1,0]<stdout>:        stepsize_jitter = 0 (Default)
[1,0]<stdout>:id = 0 (Default)
[1,0]<stdout>:data
[1,0]<stdout>:  file = radon.data.R
[1,0]<stdout>:init = 2 (Default)
[1,0]<stdout>:random
[1,0]<stdout>:  seed = -1 (Default)
[1,0]<stdout>:output
[1,0]<stdout>:  file = output.csv (Default)
[1,0]<stdout>:  diagnostic_file =  (Default)
[1,0]<stdout>:  refresh = 100 (Default)
[1,0]<stdout>:
[1,1]<stdout>:method = sample (Default)
[1,1]<stdout>:  sample
[1,1]<stdout>:    num_samples = 1000 (Default)
[1,1]<stdout>:    num_warmup = 1000 (Default)
[1,1]<stdout>:    save_warmup = 0 (Default)
[1,1]<stdout>:    thin = 1 (Default)
[1,1]<stdout>:    adapt
[1,1]<stdout>:      engaged = 1 (Default)
[1,1]<stdout>:      gamma = 0.050000000000000003 (Default)
[1,1]<stdout>:      delta = 0.80000000000000004 (Default)
[1,1]<stdout>:      kappa = 0.75 (Default)
[1,1]<stdout>:      t0 = 10 (Default)
[1,1]<stdout>:      init_buffer = 75 (Default)
[1,1]<stdout>:      term_buffer = 50 (Default)
[1,1]<stdout>:      window = 25 (Default)
[1,1]<stdout>:      num_cross_chains = 1 (Default)
[1,1]<stdout>:      cross_chain_window = 100 (Default)
[1,1]<stdout>:      cross_chain_rhat = 1.05 (Default)
[1,1]<stdout>:      cross_chain_ess = 50 (Default)
[1,1]<stdout>:    algorithm = hmc (Default)
[1,1]<stdout>:      hmc
[1,1]<stdout>:        engine = nuts (Default)
[1,1]<stdout>:          nuts
[1,1]<stdout>:            max_depth = 10 (Default)
[1,1]<stdout>:        metric = diag_e (Default)
[1,1]<stdout>:        metric_file =  (Default)
[1,1]<stdout>:        stepsize = 1 (Default)
[1,1]<stdout>:        stepsize_jitter = 0 (Default)
[1,1]<stdout>:id = 0 (Default)
[1,1]<stdout>:data
[1,1]<stdout>:  file = radon.data.R
[1,1]<stdout>:init = 2 (Default)
[1,1]<stdout>:random
[1,1]<stdout>:  seed = -1 (Default)
[1,1]<stdout>:output
[1,1]<stdout>:  file = output.csv (Default)
[1,1]<stdout>:  diagnostic_file =  (Default)
[1,1]<stdout>:  refresh = 100 (Default)
[1,1]<stdout>:
[1,2]<stdout>:method = sample (Default)
[1,2]<stdout>:  sample
[1,2]<stdout>:    num_samples = 1000 (Default)
[1,2]<stdout>:    num_warmup = 1000 (Default)
[1,2]<stdout>:    save_warmup = 0 (Default)
[1,2]<stdout>:    thin = 1 (Default)
[1,2]<stdout>:    adapt
[1,2]<stdout>:      engaged = 1 (Default)
[1,2]<stdout>:      gamma = 0.050000000000000003 (Default)
[1,2]<stdout>:      delta = 0.80000000000000004 (Default)
[1,2]<stdout>:      kappa = 0.75 (Default)
[1,2]<stdout>:      t0 = 10 (Default)
[1,2]<stdout>:      init_buffer = 75 (Default)
[1,2]<stdout>:      term_buffer = 50 (Default)
[1,2]<stdout>:      window = 25 (Default)
[1,2]<stdout>:      num_cross_chains = 1 (Default)
[1,2]<stdout>:      cross_chain_window = 100 (Default)
[1,2]<stdout>:      cross_chain_rhat = 1.05 (Default)
[1,2]<stdout>:      cross_chain_ess = 50 (Default)
[1,2]<stdout>:    algorithm = hmc (Default)
[1,2]<stdout>:      hmc
[1,2]<stdout>:        engine = nuts (Default)
[1,2]<stdout>:          nuts
[1,2]<stdout>:            max_depth = 10 (Default)
[1,2]<stdout>:        metric = diag_e (Default)
[1,2]<stdout>:        metric_file =  (Default)
[1,2]<stdout>:        stepsize = 1 (Default)
[1,2]<stdout>:        stepsize_jitter = 0 (Default)
[1,2]<stdout>:id = 0 (Default)
[1,2]<stdout>:data
[1,2]<stdout>:  file = radon.data.R
[1,2]<stdout>:init = 2 (Default)
[1,2]<stdout>:random
[1,2]<stdout>:  seed = -1 (Default)
[1,2]<stdout>:output
[1,2]<stdout>:  file = output.csv (Default)
[1,2]<stdout>:  diagnostic_file =  (Default)
[1,2]<stdout>:  refresh = 100 (Default)
[1,2]<stdout>:
[1,3]<stdout>:method = sample (Default)
[1,3]<stdout>:  sample
[1,3]<stdout>:    num_samples = 1000 (Default)
[1,3]<stdout>:    num_warmup = 1000 (Default)
[1,3]<stdout>:    save_warmup = 0 (Default)
[1,3]<stdout>:    thin = 1 (Default)
[1,3]<stdout>:    adapt
[1,3]<stdout>:      engaged = 1 (Default)
[1,3]<stdout>:      gamma = 0.050000000000000003 (Default)
[1,3]<stdout>:      delta = 0.80000000000000004 (Default)
[1,3]<stdout>:      kappa = 0.75 (Default)
[1,3]<stdout>:      t0 = 10 (Default)
[1,3]<stdout>:      init_buffer = 75 (Default)
[1,3]<stdout>:      term_buffer = 50 (Default)
[1,3]<stdout>:      window = 25 (Default)
[1,3]<stdout>:      num_cross_chains = 1 (Default)
[1,3]<stdout>:      cross_chain_window = 100 (Default)
[1,3]<stdout>:      cross_chain_rhat = 1.05 (Default)
[1,3]<stdout>:      cross_chain_ess = 50 (Default)
[1,3]<stdout>:    algorithm = hmc (Default)
[1,3]<stdout>:      hmc
[1,3]<stdout>:        engine = nuts (Default)
[1,3]<stdout>:          nuts
[1,3]<stdout>:            max_depth = 10 (Default)
[1,3]<stdout>:        metric = diag_e (Default)
[1,3]<stdout>:        metric_file =  (Default)
[1,3]<stdout>:        stepsize = 1 (Default)
[1,3]<stdout>:        stepsize_jitter = 0 (Default)
[1,3]<stdout>:id = 0 (Default)
[1,3]<stdout>:data
[1,3]<stdout>:  file = radon.data.R
[1,3]<stdout>:init = 2 (Default)
[1,3]<stdout>:random
[1,3]<stdout>:  seed = -1 (Default)
[1,3]<stdout>:output
[1,3]<stdout>:  file = output.csv (Default)
[1,3]<stdout>:  diagnostic_file =  (Default)
[1,3]<stdout>:  refresh = 100 (Default)
[1,3]<stdout>:
[1,0]<stdout>:
[1,0]<stdout>:Gradient evaluation took 0.00114 seconds
[1,0]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 11.4 seconds.
[1,0]<stdout>:Adjust your expectations accordingly!
[1,0]<stdout>:
[1,0]<stdout>:
[1,2]<stdout>:
[1,2]<stdout>:Gradient evaluation took 0.000936 seconds
[1,2]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 9.36 seconds.
[1,2]<stdout>:Adjust your expectations accordingly!
[1,2]<stdout>:
[1,2]<stdout>:
[1,1]<stdout>:
[1,1]<stdout>:Gradient evaluation took 0.001084 seconds
[1,1]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 10.84 seconds.
[1,1]<stdout>:Adjust your expectations accordingly!
[1,1]<stdout>:
[1,1]<stdout>:
[1,3]<stdout>:
[1,3]<stdout>:Gradient evaluation took 0.000926 seconds
[1,3]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 9.26 seconds.
[1,3]<stdout>:Adjust your expectations accordingly!
[1,3]<stdout>:
[1,3]<stdout>:
[1,0]<stdout>:Iteration:    1 / 2000 [  0%]  (Warmup)
[1,2]<stdout>:Iteration:    1 / 2000 [  0%]  (Warmup)
[1,1]<stdout>:Iteration:    1 / 2000 [  0%]  (Warmup)
[1,3]<stdout>:Iteration:    1 / 2000 [  0%]  (Warmup)
[1,2]<stdout>:Iteration:  100 / 2000 [  5%]  (Warmup)
[1,1]<stdout>:Iteration:  100 / 2000 [  5%]  (Warmup)
[1,2]<stdout>:Iteration:  200 / 2000 [ 10%]  (Warmup)
[1,1]<stdout>:Iteration:  200 / 2000 [ 10%]  (Warmup)
[1,2]<stdout>:Iteration:  300 / 2000 [ 15%]  (Warmup)
[1,1]<stdout>:Iteration:  300 / 2000 [ 15%]  (Warmup)
[1,0]<stdout>:Iteration:  100 / 2000 [  5%]  (Warmup)
[1,2]<stdout>:Iteration:  400 / 2000 [ 20%]  (Warmup)
[1,3]<stdout>:Iteration:  100 / 2000 [  5%]  (Warmup)
[1,1]<stdout>:Iteration:  400 / 2000 [ 20%]  (Warmup)
[1,2]<stdout>:Iteration:  500 / 2000 [ 25%]  (Warmup)
[1,0]<stdout>:Iteration:  200 / 2000 [ 10%]  (Warmup)
[1,1]<stdout>:Iteration:  500 / 2000 [ 25%]  (Warmup)
[1,3]<stdout>:Iteration:  200 / 2000 [ 10%]  (Warmup)
[1,2]<stdout>:Iteration:  600 / 2000 [ 30%]  (Warmup)
[1,0]<stdout>:Iteration:  300 / 2000 [ 15%]  (Warmup)
[1,1]<stdout>:Iteration:  600 / 2000 [ 30%]  (Warmup)
[1,3]<stdout>:Iteration:  300 / 2000 [ 15%]  (Warmup)
[1,2]<stdout>:Iteration:  700 / 2000 [ 35%]  (Warmup)
[1,0]<stdout>:Iteration:  400 / 2000 [ 20%]  (Warmup)
[1,1]<stdout>:Iteration:  700 / 2000 [ 35%]  (Warmup)
[1,3]<stdout>:Iteration:  400 / 2000 [ 20%]  (Warmup)
[1,2]<stdout>:Iteration:  800 / 2000 [ 40%]  (Warmup)
[1,0]<stdout>:Iteration:  500 / 2000 [ 25%]  (Warmup)
[1,1]<stdout>:Iteration:  800 / 2000 [ 40%]  (Warmup)
[1,3]<stdout>:Iteration:  500 / 2000 [ 25%]  (Warmup)
[1,2]<stdout>:Iteration:  900 / 2000 [ 45%]  (Warmup)
[1,0]<stdout>:Iteration:  600 / 2000 [ 30%]  (Warmup)
[1,1]<stdout>:Iteration:  900 / 2000 [ 45%]  (Warmup)
[1,3]<stdout>:Iteration:  600 / 2000 [ 30%]  (Warmup)
[1,2]<stdout>:Iteration: 1000 / 2000 [ 50%]  (Warmup)
[1,2]<stdout>:Iteration: 1001 / 2000 [ 50%]  (Sampling)
[1,0]<stdout>:Iteration:  700 / 2000 [ 35%]  (Warmup)
[1,1]<stdout>:Iteration: 1000 / 2000 [ 50%]  (Warmup)
[1,1]<stdout>:Iteration: 1001 / 2000 [ 50%]  (Sampling)
[1,3]<stdout>:Iteration:  700 / 2000 [ 35%]  (Warmup)
[1,2]<stdout>:Iteration: 1100 / 2000 [ 55%]  (Sampling)
[1,0]<stdout>:Iteration:  800 / 2000 [ 40%]  (Warmup)
[1,1]<stdout>:Iteration: 1100 / 2000 [ 55%]  (Sampling)
[1,3]<stdout>:Iteration:  800 / 2000 [ 40%]  (Warmup)
[1,2]<stdout>:Iteration: 1200 / 2000 [ 60%]  (Sampling)
[1,0]<stdout>:Iteration:  900 / 2000 [ 45%]  (Warmup)
[1,3]<stdout>:Iteration:  900 / 2000 [ 45%]  (Warmup)
[1,1]<stdout>:Iteration: 1200 / 2000 [ 60%]  (Sampling)
[1,2]<stdout>:Iteration: 1300 / 2000 [ 65%]  (Sampling)
[1,0]<stdout>:Iteration: 1000 / 2000 [ 50%]  (Warmup)
[1,0]<stdout>:Iteration: 1001 / 2000 [ 50%]  (Sampling)
[1,1]<stdout>:Iteration: 1300 / 2000 [ 65%]  (Sampling)
[1,3]<stdout>:Iteration: 1000 / 2000 [ 50%]  (Warmup)
[1,3]<stdout>:Iteration: 1001 / 2000 [ 50%]  (Sampling)
[1,2]<stdout>:Iteration: 1400 / 2000 [ 70%]  (Sampling)
[1,0]<stdout>:Iteration: 1100 / 2000 [ 55%]  (Sampling)
[1,1]<stdout>:Iteration: 1400 / 2000 [ 70%]  (Sampling)
[1,3]<stdout>:Iteration: 1100 / 2000 [ 55%]  (Sampling)
[1,2]<stdout>:Iteration: 1500 / 2000 [ 75%]  (Sampling)
[1,0]<stdout>:Iteration: 1200 / 2000 [ 60%]  (Sampling)
[1,1]<stdout>:Iteration: 1500 / 2000 [ 75%]  (Sampling)
[1,3]<stdout>:Iteration: 1200 / 2000 [ 60%]  (Sampling)
[1,2]<stdout>:Iteration: 1600 / 2000 [ 80%]  (Sampling)
[1,0]<stdout>:Iteration: 1300 / 2000 [ 65%]  (Sampling)
[1,1]<stdout>:Iteration: 1600 / 2000 [ 80%]  (Sampling)
[1,3]<stdout>:Iteration: 1300 / 2000 [ 65%]  (Sampling)
[1,2]<stdout>:Iteration: 1700 / 2000 [ 85%]  (Sampling)
[1,0]<stdout>:Iteration: 1400 / 2000 [ 70%]  (Sampling)
[1,1]<stdout>:Iteration: 1700 / 2000 [ 85%]  (Sampling)
[1,3]<stdout>:Iteration: 1400 / 2000 [ 70%]  (Sampling)
[1,2]<stdout>:Iteration: 1800 / 2000 [ 90%]  (Sampling)
[1,0]<stdout>:Iteration: 1500 / 2000 [ 75%]  (Sampling)
[1,1]<stdout>:Iteration: 1800 / 2000 [ 90%]  (Sampling)
[1,3]<stdout>:Iteration: 1500 / 2000 [ 75%]  (Sampling)
[1,0]<stdout>:Iteration: 1600 / 2000 [ 80%]  (Sampling)
[1,2]<stdout>:Iteration: 1900 / 2000 [ 95%]  (Sampling)
[1,1]<stdout>:Iteration: 1900 / 2000 [ 95%]  (Sampling)
[1,3]<stdout>:Iteration: 1600 / 2000 [ 80%]  (Sampling)
[1,0]<stdout>:Iteration: 1700 / 2000 [ 85%]  (Sampling)
[1,2]<stdout>:Iteration: 2000 / 2000 [100%]  (Sampling)
[1,2]<stdout>:
[1,2]<stdout>: Elapsed Time: 20.1273 seconds (Warm-up)
[1,2]<stdout>:               9.63335 seconds (Sampling)
[1,2]<stdout>:               29.7606 seconds (Total)
[1,2]<stdout>:
[1,1]<stdout>:Iteration: 2000 / 2000 [100%]  (Sampling)
[1,3]<stdout>:Iteration: 1700 / 2000 [ 85%]  (Sampling)
[1,1]<stdout>:
[1,1]<stdout>: Elapsed Time: 20.5619 seconds (Warm-up)
[1,1]<stdout>:               9.40207 seconds (Sampling)
[1,1]<stdout>:               29.9639 seconds (Total)
[1,1]<stdout>:
[1,0]<stdout>:Iteration: 1800 / 2000 [ 90%]  (Sampling)
[1,3]<stdout>:Iteration: 1800 / 2000 [ 90%]  (Sampling)
[1,0]<stdout>:Iteration: 1900 / 2000 [ 95%]  (Sampling)
[1,3]<stdout>:Iteration: 1900 / 2000 [ 95%]  (Sampling)
[1,0]<stdout>:Iteration: 2000 / 2000 [100%]  (Sampling)
[1,0]<stdout>:
[1,0]<stdout>: Elapsed Time: 23.0489 seconds (Warm-up)
[1,0]<stdout>:               7.9792 seconds (Sampling)
[1,0]<stdout>:               31.0281 seconds (Total)
[1,0]<stdout>:
[1,3]<stdout>:Iteration: 2000 / 2000 [100%]  (Sampling)
[1,3]<stdout>:
[1,3]<stdout>: Elapsed Time: 23.4612 seconds (Warm-up)
[1,3]<stdout>:               7.70702 seconds (Sampling)
[1,3]<stdout>:               31.1683 seconds (Total)
[1,3]<stdout>:

yizhang · February 5, 2020, 8:37pm

Likely there’s some problem with your compilation. Seems you are running a sequential version. The latest commits should give you max_num_warmup = 1000 (Default) instead of num_warmup = 1000 (Default), and you’ll see cross chain window adaptation in stdout. If there’s MPI_ADAPTED_WARMUP=1 in your make/local, check in the compilation stdout if compiler is mpicxx instead of g++ or clang++.

mike-lawrence · February 5, 2020, 9:09pm

Hm. I tried going back to the beginning of my install script, and now I don’t get any output on the stdout when running the model and the radon processes are pinning my cpu at max. Below is my install script (ubuntu 19.10); see anything obviously wrong?

#install openmpi
sudo apt install libopenmpi-dev #installs headers at `/usr/lib/x86_64-linux-gnu/openmpi/include`

#clone the campfire branch
git clone --recursive --branch mpi_warmup_framework https://github.com/stan-dev/cmdstan.git cmdstan_campfire

#navigate into the repo
cd cmdstan_campfire

#add make/local
cat <<EOT >> make/local
LDLIBS+=-lpthread
STAN_MPI=true
CXX=mpicxx
TBB_CXX_TYPE=gcc
MPI_ADAPTED_WARMUP = 1
CXXFLAGS += -isystem /usr/lib/x86_64-linux-gnu/openmpi/include
EOT

#clean and build cmdstan stuff
make clean-all
make build -j $(nproc)

#make & run the radon example
make examples/radon/radon
cd examples/radon
mpiexec -n 4 --tag-output ./radon sample data file=radon.data.R

And in case it helps, here’s the output from make examples/radon/radon:

--- Compiling, linking C++ code ---
mpicxx -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -DSTAN_LANG_MPI -DMPI_ADAPTED_WARMUP -std=c++1y -D_REENTRANT -Wno-sign-compare    -Wno-delete-non-virtual-dtor -I stan/lib/stan_math/lib/tbb_2019_U8/include -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.3 -I stan/lib/stan_math/lib/boost_1.69.0 -I stan/lib/stan_math/lib/sundials_4.1.0/include    -DBOOST_DISABLE_ASSERTS   -DSTAN_MPI   -c  -x c++ -o examples/radon/radon.o examples/radon/radon.hpp
mpicxx -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -DSTAN_LANG_MPI -DMPI_ADAPTED_WARMUP -std=c++1y -D_REENTRANT -Wno-sign-compare    -Wno-delete-non-virtual-dtor -I stan/lib/stan_math/lib/tbb_2019_U8/include -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.3 -I stan/lib/stan_math/lib/boost_1.69.0 -I stan/lib/stan_math/lib/sundials_4.1.0/include    -DBOOST_DISABLE_ASSERTS   -DSTAN_MPI        -Wl,-L,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/boost_1.69.0/stage/lib" -Wl,-rpath,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/boost_1.69.0/stage/lib" -Wl,-L,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/tbb"  examples/radon/radon.o src/cmdstan/main.o -lpthread         stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_nvecserial.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_cvodes.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_idas.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_kinsol.a stan/lib/stan_math/lib/boost_1.69.0/stage/lib/libboost_serialization.so stan/lib/stan_math/lib/boost_1.69.0/stage/lib/libboost_mpi.so stan/lib/stan_math/stan/math/prim/arr/functor/mpi_cluster_inst.o stan/lib/stan_math/lib/tbb/libtbb.so.2 -o examples/radon/radon

yizhang · February 5, 2020, 9:12pm

You cannot use STAN_MPI. Sorry I wasn’t clear. MPI_ADAPTED_WARMUP = 1 would suffice.

There’s a longer version of this answer but let’s not diverge.

mike-lawrence · February 5, 2020, 9:30pm

Thanks! Works great now. I notice that when I use 4 chains, it takes 2 windows (200 iterations) to warm up on the radon example, but when I use 6, it only takes 1 window (100 iterations). Presumably this is expected given more info from more chains?

Also, I’m on a 6-core hyperthreading cpu, so I thought I could go up to 12 chains, but when I try for 7 or greater I get the error There are not enough slots available in the system to satisfy the 12 slots that were requested by the application. Is it expected that the number of chains is limited by the physical core count and not the logical core count?

mike-lawrence · February 5, 2020, 9:41pm

Apparently yes, this is the default behaviour for MPI. To enable using logical cores rather than physical, the --use-hwthread-cpus argument is needed:

mpiexec -n $(nproc) --use-hwthread-cpus --tag-output ./radon sample data file=radon.data.R"

But this ends up making things even slower. So I guess the default restriction to physical cores is there for a reason.

yizhang · February 5, 2020, 9:51pm

No, this depends on MPI configuration. I’m on a 4-core machine with total 8 threads but I can run 13 chains simultaneously, and this is allowed in MPI by default.

It’s possible that your MPI was setup with a localhost somewhere that limits the nb. of proc to hardware cores.

mike-lawrence · February 6, 2020, 2:51pm

So I ran the radon example, exploring the influence of number of chains contributing to the warmup on the time that warmup takes. I have a 6-core hyperthreading system, and ran from 2-12 chains, each 100 times and recording the duration of warmup. Here are the histograms of warmup times:

Now, it somewhat makes sense that warmup should slow down a bit with more cores as the campfire calculations need to wait until the slowest chain is ready. And possibly using more chains than physical cores will add some delay as hyperthreading isn’t equivalent to truly separate cores. But I’m surprised to see things slowing down so dramatically. Is this expected?

avehtari · February 6, 2020, 3:52pm

Here are R scripts for running multi-chain adaptive warmup and classic warmup and getting desired diagnostics

We want see if the adaptive warmup

reduces n_warmup from the default (adaptation is able to stop early)
reduces sum_warmup_leapfrogs (total adaptation computational cost is less)
reduces mean_warmup_leapfrogs (better mass-matrix adaptation)
reduces sum_leapfrogs and mean_leapfrogs (better mass-matrix adaptation)
increases bulk_ess_per_iter and tail_ess_per_iter (better mass-matrix adaptation)
increases bulk_ess_per_leapfrog and tail_ess_per_leapfrog (better mass-matrix adaptation)

Note! Right now we don’t care about wall clock time, and reporting anything per wall clock seconds is likely to be distraction at this point. Running with different number of chains (>=4) is fine.

modelname = "normal" #assumes there exists file normal.stan in the current working directory
data = list(D=32)

# multi-chain adaptive warmup
set_cmdstan_path("~/.cmdstanr/cmdstanmpi")
mpimodel = cmdstan_model(paste(modelname,".stan", sep=""), quiet = FALSE)
datapath = cmdstanr:::process_data(data)
system(paste("mpiexec -n 4 --tag-output ./", modelname, " sample save_warmup=1 data file=", datapath, sep=""))
stanfit <- rstan::read_stan_csv(c("mpi.0.output.csv","mpi.1.output.csv","mpi.2.output.csv","mpi.3.output.csv"))
(n_warmup = stanfit@sim$warmup)
n_iter = stanfit@sim$iter-n_warmup
sampler_params <- rstan:::get_sampler_params(stanfit, inc_warmup = TRUE)
leapfrogs = sapply(sampler_params, function(x) x[, "n_leapfrog__"])
(sum_warmup_leapfrogs = sum(leapfrogs[1:n_warmup,]))
(sum_leapfrogs = sum(leapfrogs[n_warmup+(1:n_iter),]))
(mean_warmup_leapfrogs = sum_warmup_leapfrogs/n_warmup)
(mean_leapfrogs = sum_leapfrogs/n_iter)
mon = rstan::monitor(as.array(stanfit), warmup=0, print=FALSE)
(maxrhat = max(mon[,'Rhat']))
bulk_ess_per_iter = mon[,'Bulk_ESS']/n_iter
tail_ess_per_iter = mon[,'Tail_ESS']/n_iter
bulk_ess_per_leapfrog = mon[,'Bulk_ESS']/sum_leapfrogs
tail_ess_per_leapfrog = mon[,'Tail_ESS']/sum_leapfrogs
min(bulk_ess_per_iter)
min(tail_ess_per_iter)
min(bulk_ess_per_leapfrog)
min(tail_ess_per_leapfrog)
(stepsizes = sapply(sampler_params, function(x) x[, "stepsize__"])[n_iter,])

# classic warmup
set_cmdstan_path("~/.cmdstanr/cmdstan")
model = cmdstan_model(paste(modelname,".stan", sep=""), quiet = FALSE)
fit = model$sample(data=data, save_warmup=1)
stanfit <- rstan::read_stan_csv(fit$output_files())
(n_warmup = stanfit@sim$warmup)
n_iter = stanfit@sim$iter-n_warmup
sampler_params <- rstan:::get_sampler_params(stanfit, inc_warmup = TRUE)
leapfrogs = sapply(sampler_params, function(x) x[, "n_leapfrog__"])
(sum_warmup_leapfrogs = sum(leapfrogs[1:n_warmup,]))
(sum_leapfrogs = sum(leapfrogs[n_warmup+(1:n_iter),]))
(mean_warmup_leapfrogs = sum_warmup_leapfrogs/n_warmup)
(mean_leapfrogs = sum_leapfrogs/n_iter)
mon = rstan::monitor(as.array(stanfit), warmup=0, print=FALSE)
(maxrhat = max(mon[,'Rhat']))
bulk_ess_per_iter = mon[,'Bulk_ESS']/n_iter
tail_ess_per_iter = mon[,'Tail_ESS']/n_iter
bulk_ess_per_leapfrog = mon[,'Bulk_ESS']/sum_leapfrogs
tail_ess_per_leapfrog = mon[,'Tail_ESS']/sum_leapfrogs
min(bulk_ess_per_iter)
min(tail_ess_per_iter)
min(bulk_ess_per_leapfrog)
min(tail_ess_per_leapfrog)
(stepsizes = sapply(sampler_params, function(x) x[, "stepsize__"])[n_iter,])

EDIT: process_data fix. EDIT2: monitor fix. EDIT3: added *_ess_per_leapfrog and printing, EDIT4 added stepsizes. EDIT5: added maxrhat. EDIT6: fixed monitor to show correct info.

avehtari · February 6, 2020, 8:10pm

Note, @yizhang just a moment again fixed a bug in setting stepsize after warmup, You need to pull or clone the latest version (I had to clone whole repo as make clean didn’t work correctly). Results should be more sensible now.

bbbales2 · February 6, 2020, 8:37pm

@avehtari there’s also make clean-all if make clean doesn’t do the job.

@yizhang I was working on getting mpi cmdstan working in cmdstanr.

I modified the output filename logic to look like this:

if (Session::is_in_inter_chain_comm(num_chains)) {
  const Communicator& comm = Session::inter_chain_comm(num_chains);
  string_argument* p = dynamic_cast<string_argument*>(parser.arg("output")->arg("file"));
  string_argument* pd = dynamic_cast<string_argument*>(parser.arg("output")->arg("diagnostic_file"));

  std::string chain_output_name = p -> value() + "." + "mpi." + std::to_string(comm.rank());
  p -> set_value(chain_output_name);

  // Only rewrite diagnostic filename if one is given                                                                                                                                                                                        
  if((pd -> value()).length() > 0) {
    std::string chain_diagnostic_output_name = pd -> value() + "." + "mpi." + std::to_string(comm.rank());
    pd -> set_value(chain_diagnostic_output_name);
  }
}

Is there a way I can detect that only 1 process is running so I can tell it to use the non-MPI naming pattern?

Topic		Replies	Views
New adaptive warmup proposal (looking for feedback)! Algorithms	50	4267	July 31, 2020
MPI framework for parallelized warmups Algorithms mcmc	25	2082	December 7, 2019
Preliminary benchmark: incremental and adaptive parallel warm-up Publicity warm-up	4	910	November 9, 2021
Evaluating parallelization performance Developers	23	1811	October 1, 2019
Cmdstanpy, mpi speedup Developers	26	320	November 19, 2024

Cross-chain warmup adaptation using MPI

Related topics