I didn’t check but my guess is it’d be most likely `lp__`

got picked up.

The reason I was asking was I’m curious if this is the reason a low adaptation N_eff target didn’t work.

If lp__ had a higher N_eff than something else, then we’d expect to have to set the target N_eff higher to get to the point where things worked.

Getting back to more testing of the algorithm on Torsten. One can get the branch here

```
git clone --recursive --branch cross_chain_warmup https://github.com/metrumresearchgroup/cmdstan.git
```

Below is the performance summary of a simple PK model using cross-chain warmup, with target ESS=400. For this model, current default target ESS=200 isn’t always sufficient.

With cross-chain on top of Torsten’s parallel functions, I’m able to do 2-level parallelism: cross-chains communicating during warmup, and within-chain parallel solution. Here I’m showing the Chemical reactions model performance(all run with 4 chains) solved by

- regular stan run(4 independent chains),
- 4-core cross-chain run(each chain solved by 1 core),
- 8-core cross-chain run(each chain solved by 2 cores),
- 16-core cross-chain run(each chain solved by 4 cores), and
- 32-core cross-chain run(each chain solved by 8 cores).

Since the model involves a population of size 8, the within-chain parallelization evenly distributes the 8 subjects to 1, 2, 4, 8 cores. This setup improves speed in two levels:

- cross-chain warmup automatically terminates at
`num_warmup=350`

. Below is ESS performance summary.

MPI | nproc=4 | regular. |
---|---|---|

warmup.leapfrogs | 1.222100e+04 | 2.959900e+04 |

leapfrogs | 1.362400e+04 | 1.407600e+04 |

mean.warmup.leapfrogs | 3.491714e+01 | 2.959900e+01 |

mean.leapfrogs | 2.724800e+01 | 2.815200e+01 |

min(bulk_ess/iter) | 1.708000e+00 | 1.452000e+00 |

min(tail_ess/iter) | 2.184000e+00 | 2.276000e+00 |

min(bulk_ess/leapfrog) | 6.268350e-02 | 5.157715e-02 |

min(tail_ess/leapfrog) | 8.015267e-02 | 8.084683e-02 |

- within-chain parallel solution speeds up. Below is raw wall time(s) comparison.

This is all very nice. Less leapfrogs to warmup, sampling just as efficient, and scaling across computers.

For the bulk_ess/iter numbers, how are those calculated? 1.7 effective samples per MCMC draw seems too high. Does that need divided by number of chains?

Yes.

Indeed! Cutting number of leapfrog steps in half during warmup is essentially doubling its speed (or at least cutting its resource usage in half). But what’s more amazing is we seem to be getting better adaptation because the speedup’s more than you’d get from just doubling warmup speed, right? Or are these models heavily dominated by warmup?

Not really. For this model post-warmup sampling takes approximately same amount of time for regular & cross-chain runs(`regular`

vs `nproc=4`

in the above plot). With additional cores the benefit of within-chain parallelization kicks in and run time gets further reduced for both warmup & sampling(`nproc=8`

& `nproc=16`

& `nproc=32`

in the above plot).

I am wondering how mature this all is and what plans are to integrate it into the stan repositories (stan algorithm, MPI subsystem, MPI & threading)? This may sound impatient, but the result are seemingly very nice such that the community will benefit a lot from it.

New warmup strategies will likely need a bit of alignment here and there which will be quite a process to go though, but it appears as if the merits from this work are absolutely worthwhile going that mile.

I can certainly help/comment on the threading/MPI bits.

(bigger changes like this are hard to carry through - I speak out of my own recent experience)

IMO the best way one can facilitate this effort now is to play with it on his models. Once we have enough confidence on the algorithm we can move on to implementation details.

Is there already a plan as to what the bar is here?

The final goal is to provide user principled guidance on how to use it. Along the way we’ll need to identify, for example, default values for target ess & rhat, as well as optimal way of aggregating stepsize & metrics. Despite some success of the algorithm, it could also fail on simple models. Take eight school model for instance, the proposed algorithm could have suboptimal metric/stepsize and significantly more divergence. The following summary is based on a rather large target_ESS(=800) and comparison against same `num_warmup`

(=600) regular stan runs.

cross_chain_summary.pdf (74.8 KB)

The performance on `sblrc-blr`

from posteriordb looks promising. This is not from cherry-picking a nice-looking run but consistent outcome. Among the model I’ve tested this one shows the most significant improvement of ESS.

Shorter warmup and more efficient sampling for both metrics. That is cool.

I’m attaching a poster by me, @billg, @bbbales2, @avehtari for ACoP 11 last week. In that poster we shows

- cross-chain warmup’s performance on a bunch of models from posteriodb.
- how cross-chain warmup can be combined with within-chain parallelization in a multi-level parallel framework.

WED-093_Yi_Zhang.pdf (399.0 KB)

Corresponding repo can be found at https://github.com/metrumresearchgroup/acop_2020_torsten_parallelization.

Based on the benchmark in the study I plan to add cross-chain warmup as experimental feature in next Torsten release.

Nice poster!

Is there any update on cross-chain ESS?

Not sure on the question. Some benchmark I’ve run shows the ESS I see is consistent with that from standard runs. Depends on tuning parameters of the algorithm it could be higher or lower.

Apologies, on mobile the discussion cut off way early in the thread. I am on desktop now and see all of the updates. Thanks!

Can I dig this up to ask @yizhang and @bbbales2 what kind of speedup (wall time till sampling starts) and efficiency gains (total #leapfrog steps during warmup) one can expect from this? Mainly for ODE models, but also for other models? I’d say only difficult models are interesting, ie models where warmup can take quite some time.

Edit: I guess it would be easiest to just try it myself. *However*, apparently I need a more recent version of stanc/math than included in the linked repo. I guess @stevebronder and @wds15 are actively working on the above algorithm? What’s the best way to get your working copy, and is it the same algorithm as proposed/evaluated in this thread?