Just pushed a hacking solution: using awk
to replace num_warmup
in csv with actual one calculated on the fly. Works on my Mac and Ubuntu, @avehtari would you give a try?
Edit: pushed a much nicer solution by @rok_cesnovar
Just pushed a hacking solution: using awk
to replace num_warmup
in csv with actual one calculated on the fly. Works on my Mac and Ubuntu, @avehtari would you give a try?
Edit: pushed a much nicer solution by @rok_cesnovar
Thanks. That’s helpful.
BTW, in case I stepped on peoples foot – sorry for that.
It’s much clearer now to me what the rationale is and most importantly why.
I’d love to play with this; any guidance on how to set it up with cmdstanr?
@yizhang Do I need do something more than pull the latest commits from branch mpi_warmup_framework
and recompile? When I did that the error changed
stanfit <- rstan::read_stan_csv(c("mpi.0.output.csv","mpi.1.output.csv","mpi.2.output.csv","mpi.3.output.csv"))
Error in all_int_eq(warmup) : not all are integers
clone the experimental cmdstan branch
git clone --recursive --branch mpi_warmup_framework https://github.com/stan-dev/cmdstan.git
follow the other instruction in the first post about mpi , compilation and running. If Running radon example from command line works, then
In R
CMDSTANMPIPATH = "~/.cmdstanr/cmdstanmpi"
set_cmdstan_path(CMDSTANMPIPATH)
and the follow the instructions in post Cross-chain warmup adaptation using MPI - #19 by avehtari
Right now there is still a problem with easy access to warmup iteration info, and after that has solved I’ll make a new post (or edit older ones) to have all instruction in one place.
Did you get both latest cmdstan
& stan
? You can check this by looking at new csv output. The new ones should have new argument max_num_warmup
...
# num_samples = 1000 (Default)
# max_num_warmup = 1000 (Default)
...
and print actual num_warmup
when warmup terminates
...
# num_warmup = 750
# Adaptation terminated
# Step size = 0.0828001
...
@bbbales2 you mentioned earlier that you had to rebuild cmdstan
, is this one of those situations?
t31300-lr010 ~/.cmdstanr/cmdstanmpi % git pull
remote: Enumerating objects: 89, done.
remote: Counting objects: 100% (65/65), done.
remote: Compressing objects: 100% (20/20), done.
remote: Total 34 (delta 16), reused 31 (delta 13), pack-reused 0
Unpacking objects: 100% (34/34), done.
From https://github.com/stan-dev/cmdstan
2d22968..b16d18d mpi_warmup_framework -> origin/mpi_warmup_framework
26f0e77..c133f17 develop -> origin/develop
* [new branch] feature/809-stanc-args -> origin/feature/809-stanc-args
Fetching submodule stan
remote: Enumerating objects: 45, done.
remote: Counting objects: 100% (45/45), done.
remote: Compressing objects: 100% (11/11), done.
remote: Total 28 (delta 17), reused 25 (delta 14), pack-reused 0
Unpacking objects: 100% (28/28), done.
From https://github.com/stan-dev/stan
a461719..72fe227 develop -> origin/develop
57949e3..ad706bc mpi_warmup_framework -> origin/mpi_warmup_framework
...
# method = sample (Default)
# sample
# num_samples = 1000 (Default)
# max_num_warmup = 1000 (Default)
...
# Adaptation terminated
# Step size = 0.463858
# Diagonal elements of inverse mass matrix:
....
I did
make clean-all
make build -j 4
Somehow num_warmup
didn’t get printed. But I just double-checked and it works on both local mac and linux server. Let me dig a bit more.
It works if I clone again. This is now second time that make clean-all
seems to fail.
I’ll make R code example for handling warmup iterations tomorrow.
Did the latest commits change the stdout? Here’s what I get for the radon example (I expected the campfire messages at the end of each window but don’t see any):
[1,0]<stdout>:method = sample (Default)
[1,0]<stdout>: sample
[1,0]<stdout>: num_samples = 1000 (Default)
[1,0]<stdout>: num_warmup = 1000 (Default)
[1,0]<stdout>: save_warmup = 0 (Default)
[1,0]<stdout>: thin = 1 (Default)
[1,0]<stdout>: adapt
[1,0]<stdout>: engaged = 1 (Default)
[1,0]<stdout>: gamma = 0.050000000000000003 (Default)
[1,0]<stdout>: delta = 0.80000000000000004 (Default)
[1,0]<stdout>: kappa = 0.75 (Default)
[1,0]<stdout>: t0 = 10 (Default)
[1,0]<stdout>: init_buffer = 75 (Default)
[1,0]<stdout>: term_buffer = 50 (Default)
[1,0]<stdout>: window = 25 (Default)
[1,0]<stdout>: num_cross_chains = 1 (Default)
[1,0]<stdout>: cross_chain_window = 100 (Default)
[1,0]<stdout>: cross_chain_rhat = 1.05 (Default)
[1,0]<stdout>: cross_chain_ess = 50 (Default)
[1,0]<stdout>: algorithm = hmc (Default)
[1,0]<stdout>: hmc
[1,0]<stdout>: engine = nuts (Default)
[1,0]<stdout>: nuts
[1,0]<stdout>: max_depth = 10 (Default)
[1,0]<stdout>: metric = diag_e (Default)
[1,0]<stdout>: metric_file = (Default)
[1,0]<stdout>: stepsize = 1 (Default)
[1,0]<stdout>: stepsize_jitter = 0 (Default)
[1,0]<stdout>:id = 0 (Default)
[1,0]<stdout>:data
[1,0]<stdout>: file = radon.data.R
[1,0]<stdout>:init = 2 (Default)
[1,0]<stdout>:random
[1,0]<stdout>: seed = -1 (Default)
[1,0]<stdout>:output
[1,0]<stdout>: file = output.csv (Default)
[1,0]<stdout>: diagnostic_file = (Default)
[1,0]<stdout>: refresh = 100 (Default)
[1,0]<stdout>:
[1,1]<stdout>:method = sample (Default)
[1,1]<stdout>: sample
[1,1]<stdout>: num_samples = 1000 (Default)
[1,1]<stdout>: num_warmup = 1000 (Default)
[1,1]<stdout>: save_warmup = 0 (Default)
[1,1]<stdout>: thin = 1 (Default)
[1,1]<stdout>: adapt
[1,1]<stdout>: engaged = 1 (Default)
[1,1]<stdout>: gamma = 0.050000000000000003 (Default)
[1,1]<stdout>: delta = 0.80000000000000004 (Default)
[1,1]<stdout>: kappa = 0.75 (Default)
[1,1]<stdout>: t0 = 10 (Default)
[1,1]<stdout>: init_buffer = 75 (Default)
[1,1]<stdout>: term_buffer = 50 (Default)
[1,1]<stdout>: window = 25 (Default)
[1,1]<stdout>: num_cross_chains = 1 (Default)
[1,1]<stdout>: cross_chain_window = 100 (Default)
[1,1]<stdout>: cross_chain_rhat = 1.05 (Default)
[1,1]<stdout>: cross_chain_ess = 50 (Default)
[1,1]<stdout>: algorithm = hmc (Default)
[1,1]<stdout>: hmc
[1,1]<stdout>: engine = nuts (Default)
[1,1]<stdout>: nuts
[1,1]<stdout>: max_depth = 10 (Default)
[1,1]<stdout>: metric = diag_e (Default)
[1,1]<stdout>: metric_file = (Default)
[1,1]<stdout>: stepsize = 1 (Default)
[1,1]<stdout>: stepsize_jitter = 0 (Default)
[1,1]<stdout>:id = 0 (Default)
[1,1]<stdout>:data
[1,1]<stdout>: file = radon.data.R
[1,1]<stdout>:init = 2 (Default)
[1,1]<stdout>:random
[1,1]<stdout>: seed = -1 (Default)
[1,1]<stdout>:output
[1,1]<stdout>: file = output.csv (Default)
[1,1]<stdout>: diagnostic_file = (Default)
[1,1]<stdout>: refresh = 100 (Default)
[1,1]<stdout>:
[1,2]<stdout>:method = sample (Default)
[1,2]<stdout>: sample
[1,2]<stdout>: num_samples = 1000 (Default)
[1,2]<stdout>: num_warmup = 1000 (Default)
[1,2]<stdout>: save_warmup = 0 (Default)
[1,2]<stdout>: thin = 1 (Default)
[1,2]<stdout>: adapt
[1,2]<stdout>: engaged = 1 (Default)
[1,2]<stdout>: gamma = 0.050000000000000003 (Default)
[1,2]<stdout>: delta = 0.80000000000000004 (Default)
[1,2]<stdout>: kappa = 0.75 (Default)
[1,2]<stdout>: t0 = 10 (Default)
[1,2]<stdout>: init_buffer = 75 (Default)
[1,2]<stdout>: term_buffer = 50 (Default)
[1,2]<stdout>: window = 25 (Default)
[1,2]<stdout>: num_cross_chains = 1 (Default)
[1,2]<stdout>: cross_chain_window = 100 (Default)
[1,2]<stdout>: cross_chain_rhat = 1.05 (Default)
[1,2]<stdout>: cross_chain_ess = 50 (Default)
[1,2]<stdout>: algorithm = hmc (Default)
[1,2]<stdout>: hmc
[1,2]<stdout>: engine = nuts (Default)
[1,2]<stdout>: nuts
[1,2]<stdout>: max_depth = 10 (Default)
[1,2]<stdout>: metric = diag_e (Default)
[1,2]<stdout>: metric_file = (Default)
[1,2]<stdout>: stepsize = 1 (Default)
[1,2]<stdout>: stepsize_jitter = 0 (Default)
[1,2]<stdout>:id = 0 (Default)
[1,2]<stdout>:data
[1,2]<stdout>: file = radon.data.R
[1,2]<stdout>:init = 2 (Default)
[1,2]<stdout>:random
[1,2]<stdout>: seed = -1 (Default)
[1,2]<stdout>:output
[1,2]<stdout>: file = output.csv (Default)
[1,2]<stdout>: diagnostic_file = (Default)
[1,2]<stdout>: refresh = 100 (Default)
[1,2]<stdout>:
[1,3]<stdout>:method = sample (Default)
[1,3]<stdout>: sample
[1,3]<stdout>: num_samples = 1000 (Default)
[1,3]<stdout>: num_warmup = 1000 (Default)
[1,3]<stdout>: save_warmup = 0 (Default)
[1,3]<stdout>: thin = 1 (Default)
[1,3]<stdout>: adapt
[1,3]<stdout>: engaged = 1 (Default)
[1,3]<stdout>: gamma = 0.050000000000000003 (Default)
[1,3]<stdout>: delta = 0.80000000000000004 (Default)
[1,3]<stdout>: kappa = 0.75 (Default)
[1,3]<stdout>: t0 = 10 (Default)
[1,3]<stdout>: init_buffer = 75 (Default)
[1,3]<stdout>: term_buffer = 50 (Default)
[1,3]<stdout>: window = 25 (Default)
[1,3]<stdout>: num_cross_chains = 1 (Default)
[1,3]<stdout>: cross_chain_window = 100 (Default)
[1,3]<stdout>: cross_chain_rhat = 1.05 (Default)
[1,3]<stdout>: cross_chain_ess = 50 (Default)
[1,3]<stdout>: algorithm = hmc (Default)
[1,3]<stdout>: hmc
[1,3]<stdout>: engine = nuts (Default)
[1,3]<stdout>: nuts
[1,3]<stdout>: max_depth = 10 (Default)
[1,3]<stdout>: metric = diag_e (Default)
[1,3]<stdout>: metric_file = (Default)
[1,3]<stdout>: stepsize = 1 (Default)
[1,3]<stdout>: stepsize_jitter = 0 (Default)
[1,3]<stdout>:id = 0 (Default)
[1,3]<stdout>:data
[1,3]<stdout>: file = radon.data.R
[1,3]<stdout>:init = 2 (Default)
[1,3]<stdout>:random
[1,3]<stdout>: seed = -1 (Default)
[1,3]<stdout>:output
[1,3]<stdout>: file = output.csv (Default)
[1,3]<stdout>: diagnostic_file = (Default)
[1,3]<stdout>: refresh = 100 (Default)
[1,3]<stdout>:
[1,0]<stdout>:
[1,0]<stdout>:Gradient evaluation took 0.00114 seconds
[1,0]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 11.4 seconds.
[1,0]<stdout>:Adjust your expectations accordingly!
[1,0]<stdout>:
[1,0]<stdout>:
[1,2]<stdout>:
[1,2]<stdout>:Gradient evaluation took 0.000936 seconds
[1,2]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 9.36 seconds.
[1,2]<stdout>:Adjust your expectations accordingly!
[1,2]<stdout>:
[1,2]<stdout>:
[1,1]<stdout>:
[1,1]<stdout>:Gradient evaluation took 0.001084 seconds
[1,1]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 10.84 seconds.
[1,1]<stdout>:Adjust your expectations accordingly!
[1,1]<stdout>:
[1,1]<stdout>:
[1,3]<stdout>:
[1,3]<stdout>:Gradient evaluation took 0.000926 seconds
[1,3]<stdout>:1000 transitions using 10 leapfrog steps per transition would take 9.26 seconds.
[1,3]<stdout>:Adjust your expectations accordingly!
[1,3]<stdout>:
[1,3]<stdout>:
[1,0]<stdout>:Iteration: 1 / 2000 [ 0%] (Warmup)
[1,2]<stdout>:Iteration: 1 / 2000 [ 0%] (Warmup)
[1,1]<stdout>:Iteration: 1 / 2000 [ 0%] (Warmup)
[1,3]<stdout>:Iteration: 1 / 2000 [ 0%] (Warmup)
[1,2]<stdout>:Iteration: 100 / 2000 [ 5%] (Warmup)
[1,1]<stdout>:Iteration: 100 / 2000 [ 5%] (Warmup)
[1,2]<stdout>:Iteration: 200 / 2000 [ 10%] (Warmup)
[1,1]<stdout>:Iteration: 200 / 2000 [ 10%] (Warmup)
[1,2]<stdout>:Iteration: 300 / 2000 [ 15%] (Warmup)
[1,1]<stdout>:Iteration: 300 / 2000 [ 15%] (Warmup)
[1,0]<stdout>:Iteration: 100 / 2000 [ 5%] (Warmup)
[1,2]<stdout>:Iteration: 400 / 2000 [ 20%] (Warmup)
[1,3]<stdout>:Iteration: 100 / 2000 [ 5%] (Warmup)
[1,1]<stdout>:Iteration: 400 / 2000 [ 20%] (Warmup)
[1,2]<stdout>:Iteration: 500 / 2000 [ 25%] (Warmup)
[1,0]<stdout>:Iteration: 200 / 2000 [ 10%] (Warmup)
[1,1]<stdout>:Iteration: 500 / 2000 [ 25%] (Warmup)
[1,3]<stdout>:Iteration: 200 / 2000 [ 10%] (Warmup)
[1,2]<stdout>:Iteration: 600 / 2000 [ 30%] (Warmup)
[1,0]<stdout>:Iteration: 300 / 2000 [ 15%] (Warmup)
[1,1]<stdout>:Iteration: 600 / 2000 [ 30%] (Warmup)
[1,3]<stdout>:Iteration: 300 / 2000 [ 15%] (Warmup)
[1,2]<stdout>:Iteration: 700 / 2000 [ 35%] (Warmup)
[1,0]<stdout>:Iteration: 400 / 2000 [ 20%] (Warmup)
[1,1]<stdout>:Iteration: 700 / 2000 [ 35%] (Warmup)
[1,3]<stdout>:Iteration: 400 / 2000 [ 20%] (Warmup)
[1,2]<stdout>:Iteration: 800 / 2000 [ 40%] (Warmup)
[1,0]<stdout>:Iteration: 500 / 2000 [ 25%] (Warmup)
[1,1]<stdout>:Iteration: 800 / 2000 [ 40%] (Warmup)
[1,3]<stdout>:Iteration: 500 / 2000 [ 25%] (Warmup)
[1,2]<stdout>:Iteration: 900 / 2000 [ 45%] (Warmup)
[1,0]<stdout>:Iteration: 600 / 2000 [ 30%] (Warmup)
[1,1]<stdout>:Iteration: 900 / 2000 [ 45%] (Warmup)
[1,3]<stdout>:Iteration: 600 / 2000 [ 30%] (Warmup)
[1,2]<stdout>:Iteration: 1000 / 2000 [ 50%] (Warmup)
[1,2]<stdout>:Iteration: 1001 / 2000 [ 50%] (Sampling)
[1,0]<stdout>:Iteration: 700 / 2000 [ 35%] (Warmup)
[1,1]<stdout>:Iteration: 1000 / 2000 [ 50%] (Warmup)
[1,1]<stdout>:Iteration: 1001 / 2000 [ 50%] (Sampling)
[1,3]<stdout>:Iteration: 700 / 2000 [ 35%] (Warmup)
[1,2]<stdout>:Iteration: 1100 / 2000 [ 55%] (Sampling)
[1,0]<stdout>:Iteration: 800 / 2000 [ 40%] (Warmup)
[1,1]<stdout>:Iteration: 1100 / 2000 [ 55%] (Sampling)
[1,3]<stdout>:Iteration: 800 / 2000 [ 40%] (Warmup)
[1,2]<stdout>:Iteration: 1200 / 2000 [ 60%] (Sampling)
[1,0]<stdout>:Iteration: 900 / 2000 [ 45%] (Warmup)
[1,3]<stdout>:Iteration: 900 / 2000 [ 45%] (Warmup)
[1,1]<stdout>:Iteration: 1200 / 2000 [ 60%] (Sampling)
[1,2]<stdout>:Iteration: 1300 / 2000 [ 65%] (Sampling)
[1,0]<stdout>:Iteration: 1000 / 2000 [ 50%] (Warmup)
[1,0]<stdout>:Iteration: 1001 / 2000 [ 50%] (Sampling)
[1,1]<stdout>:Iteration: 1300 / 2000 [ 65%] (Sampling)
[1,3]<stdout>:Iteration: 1000 / 2000 [ 50%] (Warmup)
[1,3]<stdout>:Iteration: 1001 / 2000 [ 50%] (Sampling)
[1,2]<stdout>:Iteration: 1400 / 2000 [ 70%] (Sampling)
[1,0]<stdout>:Iteration: 1100 / 2000 [ 55%] (Sampling)
[1,1]<stdout>:Iteration: 1400 / 2000 [ 70%] (Sampling)
[1,3]<stdout>:Iteration: 1100 / 2000 [ 55%] (Sampling)
[1,2]<stdout>:Iteration: 1500 / 2000 [ 75%] (Sampling)
[1,0]<stdout>:Iteration: 1200 / 2000 [ 60%] (Sampling)
[1,1]<stdout>:Iteration: 1500 / 2000 [ 75%] (Sampling)
[1,3]<stdout>:Iteration: 1200 / 2000 [ 60%] (Sampling)
[1,2]<stdout>:Iteration: 1600 / 2000 [ 80%] (Sampling)
[1,0]<stdout>:Iteration: 1300 / 2000 [ 65%] (Sampling)
[1,1]<stdout>:Iteration: 1600 / 2000 [ 80%] (Sampling)
[1,3]<stdout>:Iteration: 1300 / 2000 [ 65%] (Sampling)
[1,2]<stdout>:Iteration: 1700 / 2000 [ 85%] (Sampling)
[1,0]<stdout>:Iteration: 1400 / 2000 [ 70%] (Sampling)
[1,1]<stdout>:Iteration: 1700 / 2000 [ 85%] (Sampling)
[1,3]<stdout>:Iteration: 1400 / 2000 [ 70%] (Sampling)
[1,2]<stdout>:Iteration: 1800 / 2000 [ 90%] (Sampling)
[1,0]<stdout>:Iteration: 1500 / 2000 [ 75%] (Sampling)
[1,1]<stdout>:Iteration: 1800 / 2000 [ 90%] (Sampling)
[1,3]<stdout>:Iteration: 1500 / 2000 [ 75%] (Sampling)
[1,0]<stdout>:Iteration: 1600 / 2000 [ 80%] (Sampling)
[1,2]<stdout>:Iteration: 1900 / 2000 [ 95%] (Sampling)
[1,1]<stdout>:Iteration: 1900 / 2000 [ 95%] (Sampling)
[1,3]<stdout>:Iteration: 1600 / 2000 [ 80%] (Sampling)
[1,0]<stdout>:Iteration: 1700 / 2000 [ 85%] (Sampling)
[1,2]<stdout>:Iteration: 2000 / 2000 [100%] (Sampling)
[1,2]<stdout>:
[1,2]<stdout>: Elapsed Time: 20.1273 seconds (Warm-up)
[1,2]<stdout>: 9.63335 seconds (Sampling)
[1,2]<stdout>: 29.7606 seconds (Total)
[1,2]<stdout>:
[1,1]<stdout>:Iteration: 2000 / 2000 [100%] (Sampling)
[1,3]<stdout>:Iteration: 1700 / 2000 [ 85%] (Sampling)
[1,1]<stdout>:
[1,1]<stdout>: Elapsed Time: 20.5619 seconds (Warm-up)
[1,1]<stdout>: 9.40207 seconds (Sampling)
[1,1]<stdout>: 29.9639 seconds (Total)
[1,1]<stdout>:
[1,0]<stdout>:Iteration: 1800 / 2000 [ 90%] (Sampling)
[1,3]<stdout>:Iteration: 1800 / 2000 [ 90%] (Sampling)
[1,0]<stdout>:Iteration: 1900 / 2000 [ 95%] (Sampling)
[1,3]<stdout>:Iteration: 1900 / 2000 [ 95%] (Sampling)
[1,0]<stdout>:Iteration: 2000 / 2000 [100%] (Sampling)
[1,0]<stdout>:
[1,0]<stdout>: Elapsed Time: 23.0489 seconds (Warm-up)
[1,0]<stdout>: 7.9792 seconds (Sampling)
[1,0]<stdout>: 31.0281 seconds (Total)
[1,0]<stdout>:
[1,3]<stdout>:Iteration: 2000 / 2000 [100%] (Sampling)
[1,3]<stdout>:
[1,3]<stdout>: Elapsed Time: 23.4612 seconds (Warm-up)
[1,3]<stdout>: 7.70702 seconds (Sampling)
[1,3]<stdout>: 31.1683 seconds (Total)
[1,3]<stdout>:
Likely there’s some problem with your compilation. Seems you are running a sequential version. The latest commits should give you max_num_warmup = 1000 (Default)
instead of num_warmup = 1000 (Default)
, and you’ll see cross chain window adaptation in stdout. If there’s MPI_ADAPTED_WARMUP=1
in your make/local
, check in the compilation stdout if compiler is mpicxx
instead of g++
or clang++
.
Hm. I tried going back to the beginning of my install script, and now I don’t get any output on the stdout when running the model and the radon
processes are pinning my cpu at max. Below is my install script (ubuntu 19.10); see anything obviously wrong?
#install openmpi
sudo apt install libopenmpi-dev #installs headers at `/usr/lib/x86_64-linux-gnu/openmpi/include`
#clone the campfire branch
git clone --recursive --branch mpi_warmup_framework https://github.com/stan-dev/cmdstan.git cmdstan_campfire
#navigate into the repo
cd cmdstan_campfire
#add make/local
cat <<EOT >> make/local
LDLIBS+=-lpthread
STAN_MPI=true
CXX=mpicxx
TBB_CXX_TYPE=gcc
MPI_ADAPTED_WARMUP = 1
CXXFLAGS += -isystem /usr/lib/x86_64-linux-gnu/openmpi/include
EOT
#clean and build cmdstan stuff
make clean-all
make build -j $(nproc)
#make & run the radon example
make examples/radon/radon
cd examples/radon
mpiexec -n 4 --tag-output ./radon sample data file=radon.data.R
And in case it helps, here’s the output from make examples/radon/radon
:
--- Compiling, linking C++ code ---
mpicxx -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -DSTAN_LANG_MPI -DMPI_ADAPTED_WARMUP -std=c++1y -D_REENTRANT -Wno-sign-compare -Wno-delete-non-virtual-dtor -I stan/lib/stan_math/lib/tbb_2019_U8/include -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.3 -I stan/lib/stan_math/lib/boost_1.69.0 -I stan/lib/stan_math/lib/sundials_4.1.0/include -DBOOST_DISABLE_ASSERTS -DSTAN_MPI -c -x c++ -o examples/radon/radon.o examples/radon/radon.hpp
mpicxx -isystem /usr/lib/x86_64-linux-gnu/openmpi/include -DSTAN_LANG_MPI -DMPI_ADAPTED_WARMUP -std=c++1y -D_REENTRANT -Wno-sign-compare -Wno-delete-non-virtual-dtor -I stan/lib/stan_math/lib/tbb_2019_U8/include -O3 -I src -I stan/src -I lib/rapidjson_1.1.0/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.3.3 -I stan/lib/stan_math/lib/boost_1.69.0 -I stan/lib/stan_math/lib/sundials_4.1.0/include -DBOOST_DISABLE_ASSERTS -DSTAN_MPI -Wl,-L,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/boost_1.69.0/stage/lib" -Wl,-rpath,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/boost_1.69.0/stage/lib" -Wl,-L,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/tbb" -Wl,-rpath,"/home/mike/cmdstan_campfire/stan/lib/stan_math/lib/tbb" examples/radon/radon.o src/cmdstan/main.o -lpthread stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_nvecserial.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_cvodes.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_idas.a stan/lib/stan_math/lib/sundials_4.1.0/lib/libsundials_kinsol.a stan/lib/stan_math/lib/boost_1.69.0/stage/lib/libboost_serialization.so stan/lib/stan_math/lib/boost_1.69.0/stage/lib/libboost_mpi.so stan/lib/stan_math/stan/math/prim/arr/functor/mpi_cluster_inst.o stan/lib/stan_math/lib/tbb/libtbb.so.2 -o examples/radon/radon
You cannot use STAN_MPI. Sorry I wasn’t clear. MPI_ADAPTED_WARMUP = 1
would suffice.
There’s a longer version of this answer but let’s not diverge.
Thanks! Works great now. I notice that when I use 4 chains, it takes 2 windows (200 iterations) to warm up on the radon example, but when I use 6, it only takes 1 window (100 iterations). Presumably this is expected given more info from more chains?
Also, I’m on a 6-core hyperthreading cpu, so I thought I could go up to 12 chains, but when I try for 7 or greater I get the error There are not enough slots available in the system to satisfy the 12 slots that were requested by the application
. Is it expected that the number of chains is limited by the physical core count and not the logical core count?
Apparently yes, this is the default behaviour for MPI. To enable using logical cores rather than physical, the --use-hwthread-cpus
argument is needed:
mpiexec -n $(nproc) --use-hwthread-cpus --tag-output ./radon sample data file=radon.data.R"
But this ends up making things even slower. So I guess the default restriction to physical cores is there for a reason.
No, this depends on MPI configuration. I’m on a 4-core machine with total 8 threads but I can run 13 chains simultaneously, and this is allowed in MPI by default.
It’s possible that your MPI was setup with a localhost somewhere that limits the nb. of proc to hardware cores.
So I ran the radon example, exploring the influence of number of chains contributing to the warmup on the time that warmup takes. I have a 6-core hyperthreading system, and ran from 2-12 chains, each 100 times and recording the duration of warmup. Here are the histograms of warmup times:
Now, it somewhat makes sense that warmup should slow down a bit with more cores as the campfire calculations need to wait until the slowest chain is ready. And possibly using more chains than physical cores will add some delay as hyperthreading isn’t equivalent to truly separate cores. But I’m surprised to see things slowing down so dramatically. Is this expected?
Here are R scripts for running multi-chain adaptive warmup and classic warmup and getting desired diagnostics
We want see if the adaptive warmup
n_warmup
from the default (adaptation is able to stop early)sum_warmup_leapfrogs
(total adaptation computational cost is less)mean_warmup_leapfrogs
(better mass-matrix adaptation)sum_leapfrogs
and mean_leapfrogs
(better mass-matrix adaptation)bulk_ess_per_iter
and tail_ess_per_iter
(better mass-matrix adaptation)bulk_ess_per_leapfrog
and tail_ess_per_leapfrog
(better mass-matrix adaptation)Note! Right now we don’t care about wall clock time, and reporting anything per wall clock seconds is likely to be distraction at this point. Running with different number of chains (>=4) is fine.
modelname = "normal" #assumes there exists file normal.stan in the current working directory
data = list(D=32)
# multi-chain adaptive warmup
set_cmdstan_path("~/.cmdstanr/cmdstanmpi")
mpimodel = cmdstan_model(paste(modelname,".stan", sep=""), quiet = FALSE)
datapath = cmdstanr:::process_data(data)
system(paste("mpiexec -n 4 --tag-output ./", modelname, " sample save_warmup=1 data file=", datapath, sep=""))
stanfit <- rstan::read_stan_csv(c("mpi.0.output.csv","mpi.1.output.csv","mpi.2.output.csv","mpi.3.output.csv"))
(n_warmup = stanfit@sim$warmup)
n_iter = stanfit@sim$iter-n_warmup
sampler_params <- rstan:::get_sampler_params(stanfit, inc_warmup = TRUE)
leapfrogs = sapply(sampler_params, function(x) x[, "n_leapfrog__"])
(sum_warmup_leapfrogs = sum(leapfrogs[1:n_warmup,]))
(sum_leapfrogs = sum(leapfrogs[n_warmup+(1:n_iter),]))
(mean_warmup_leapfrogs = sum_warmup_leapfrogs/n_warmup)
(mean_leapfrogs = sum_leapfrogs/n_iter)
mon = rstan::monitor(as.array(stanfit), warmup=0, print=FALSE)
(maxrhat = max(mon[,'Rhat']))
bulk_ess_per_iter = mon[,'Bulk_ESS']/n_iter
tail_ess_per_iter = mon[,'Tail_ESS']/n_iter
bulk_ess_per_leapfrog = mon[,'Bulk_ESS']/sum_leapfrogs
tail_ess_per_leapfrog = mon[,'Tail_ESS']/sum_leapfrogs
min(bulk_ess_per_iter)
min(tail_ess_per_iter)
min(bulk_ess_per_leapfrog)
min(tail_ess_per_leapfrog)
(stepsizes = sapply(sampler_params, function(x) x[, "stepsize__"])[n_iter,])
# classic warmup
set_cmdstan_path("~/.cmdstanr/cmdstan")
model = cmdstan_model(paste(modelname,".stan", sep=""), quiet = FALSE)
fit = model$sample(data=data, save_warmup=1)
stanfit <- rstan::read_stan_csv(fit$output_files())
(n_warmup = stanfit@sim$warmup)
n_iter = stanfit@sim$iter-n_warmup
sampler_params <- rstan:::get_sampler_params(stanfit, inc_warmup = TRUE)
leapfrogs = sapply(sampler_params, function(x) x[, "n_leapfrog__"])
(sum_warmup_leapfrogs = sum(leapfrogs[1:n_warmup,]))
(sum_leapfrogs = sum(leapfrogs[n_warmup+(1:n_iter),]))
(mean_warmup_leapfrogs = sum_warmup_leapfrogs/n_warmup)
(mean_leapfrogs = sum_leapfrogs/n_iter)
mon = rstan::monitor(as.array(stanfit), warmup=0, print=FALSE)
(maxrhat = max(mon[,'Rhat']))
bulk_ess_per_iter = mon[,'Bulk_ESS']/n_iter
tail_ess_per_iter = mon[,'Tail_ESS']/n_iter
bulk_ess_per_leapfrog = mon[,'Bulk_ESS']/sum_leapfrogs
tail_ess_per_leapfrog = mon[,'Tail_ESS']/sum_leapfrogs
min(bulk_ess_per_iter)
min(tail_ess_per_iter)
min(bulk_ess_per_leapfrog)
min(tail_ess_per_leapfrog)
(stepsizes = sapply(sampler_params, function(x) x[, "stepsize__"])[n_iter,])
EDIT: process_data fix. EDIT2: monitor fix. EDIT3: added *_ess_per_leapfrog and printing, EDIT4 added stepsizes. EDIT5: added maxrhat. EDIT6: fixed monitor to show correct info.
Note, @yizhang just a moment again fixed a bug in setting stepsize after warmup, You need to pull or clone the latest version (I had to clone whole repo as make clean didn’t work correctly). Results should be more sensible now.
@avehtari there’s also make clean-all if make clean doesn’t do the job.
@yizhang I was working on getting mpi cmdstan working in cmdstanr.
I modified the output filename logic to look like this:
if (Session::is_in_inter_chain_comm(num_chains)) {
const Communicator& comm = Session::inter_chain_comm(num_chains);
string_argument* p = dynamic_cast<string_argument*>(parser.arg("output")->arg("file"));
string_argument* pd = dynamic_cast<string_argument*>(parser.arg("output")->arg("diagnostic_file"));
std::string chain_output_name = p -> value() + "." + "mpi." + std::to_string(comm.rank());
p -> set_value(chain_output_name);
// Only rewrite diagnostic filename if one is given
if((pd -> value()).length() > 0) {
std::string chain_diagnostic_output_name = pd -> value() + "." + "mpi." + std::to_string(comm.rank());
pd -> set_value(chain_diagnostic_output_name);
}
}
Is there a way I can detect that only 1 process is running so I can tell it to use the non-MPI naming pattern?