With cross-chain on top of Torsten’s parallel functions, I’m able to do 2-level parallelism: cross-chains communicating during warmup, and within-chain parallel solution. Here I’m showing the Chemical reactions model performance(all run with 4 chains) solved by
- regular stan run(4 independent chains),
- 4-core cross-chain run(each chain solved by 1 core),
- 8-core cross-chain run(each chain solved by 2 cores),
- 16-core cross-chain run(each chain solved by 4 cores), and
- 32-core cross-chain run(each chain solved by 8 cores).
Since the model involves a population of size 8, the within-chain parallelization evenly distributes the 8 subjects to 1, 2, 4, 8 cores. This setup improves speed in two levels:
- cross-chain warmup automatically terminates at
num_warmup=350
. Below is ESS performance summary.
MPI | nproc=4 | regular. |
---|---|---|
warmup.leapfrogs | 1.222100e+04 | 2.959900e+04 |
leapfrogs | 1.362400e+04 | 1.407600e+04 |
mean.warmup.leapfrogs | 3.491714e+01 | 2.959900e+01 |
mean.leapfrogs | 2.724800e+01 | 2.815200e+01 |
min(bulk_ess/iter) | 1.708000e+00 | 1.452000e+00 |
min(tail_ess/iter) | 2.184000e+00 | 2.276000e+00 |
min(bulk_ess/leapfrog) | 6.268350e-02 | 5.157715e-02 |
min(tail_ess/leapfrog) | 8.015267e-02 | 8.084683e-02 |
- within-chain parallel solution speeds up. Below is raw wall time(s) comparison.