Preliminary benchmark: incremental and adaptive parallel warm-up

After no response here I thought I’d post a preliminary benchmark for my warm-up procedure, evaluated for some models from posteriordb and some additional models (@yizhang’s chemical reaction model, the monster model and @palmer’s bioenergetic ODE model).

Everything was run on my local notebook (which has 6 physical cores and enough RAM) with 6 parallel cores and default settings, except (only for the regular warm-up) for the eight schools model ( adapt_delta=.9) and the chemical reaction model (custom inits). My warm-up uses default settings for all models and does neither need nor benefit substantially from custom inits, avoiding spurious modes on its own.

For all models I’ve run the regular and custom warm-up both with dense and with diagonal metrics, and for comparison selected the better performing metric (by posterior+method). Due to computational constraints I’ve only run a few (1-3) of the expensive models and more (<20) of the cheaper models.

For the comparison I only used runs which have converged without issues (only an issue for the regular warm-up). For datasets mcycle_splines and mcycle_gp none of the regular runs had no divergences (I’m assuming the models work with higher adapt_delta). I’ve started the regular run of the Monster model this morning and after roughly 5 hours it stands at warm-up iteration 234, using a dense metric.

Personally, I neither care about cheap models (<1s runtime) nor about marginal speed-ups. The cheap models are included as a check that sampling performance is not negatively affected by the custom warm-up procedure.

Although @avehtari said

I am including exactly these distracting warm-up wall times, because the total number of leapfrog steps is difficult to compare because some of my leapfrog steps are much cheaper than the final ones. I also report the total number of (effective) leapfrog iterations during warm-up, which (for most models) correlates nicely with the warm-up wall times.

In summary, I report three metrics (higher is always worse), averaged for each posterior and method:

  • max(sampling_leapfrog/neff): The total number of leapfrog steps across chains divided by stansummary's minimal N_eff.
  • warmup_wall_time: The (maximal) warm-up wall time in seconds.
  • warmup_leapfrog/chain [k] The total number of effective leapfrog steps per chain divided by thousand.

For each metric I report [regular]/[incremental] = [ratio] ([improvement = 1-incremental/regular]). Improvement (e.g. speed-up)

  • for max(sampling_leapfrog/neff) is usually positive and ranges from -17% to +99%,
  • for warmup_wall_time is always positive and ranges from +15% to + 97% and
  • for warmup_leapfrog/chain [k] is also always positive and ranges from +50% to + 99%.

The summarized data, sorted my regular wall time:

max(sampling_leapfrog/neff) warmup_wall_time [s] warmup_leapfrog/chain [k]
dataset
eight_schools 32.3 / 31.5 = 1.0 (3%) 0.075 / 0.026 = 2.9 (65%) 12.4 / 3.0 = 4.2 (76%)
sblrc 71.2 / 11.7 = 6.1 (84%) 0.11 / 0.086 = 1.2 (19%) 16.9 / 2.9 = 5.8 (83%)
garch 20.0 / 22.1 = 0.91 (-10%) 0.24 / 0.066 = 3.7 (73%) 8.3 / 1.3 = 6.5 (85%)
arma 13.1 / 10.6 = 1.2 (19%) 0.31 / 0.051 = 6.1 (84%) 12.2 / 1.2 = 10.0 (90%)
arK 13.3 / 15.0 = 0.89 (-13%) 0.29 / 0.11 = 2.6 (61%) 8.5 / 2.1 = 4.1 (76%)
kilpisjarvi_mod 689 / 7.5 = 92.3 (99%) 1.2 / 0.037 = 33.3 (97%) 434 / 2.1 = 210 (100%)
earnings 9.0 / 7.7 = 1.2 (14%) 1.3 / 0.098 = 13.0 (92%) 45.7 / 1.6 = 28.1 (96%)
gp_pois_regr 276 / 323 = 0.85 (-17%) 0.51 / 0.43 = 1.2 (15%) 54.2 / 27.3 = 2.0 (50%)
low_dim_gauss_mix 11.7 / 8.4 = 1.4 (28%) 1.7 / 0.51 = 3.2 (69%) 6.6 / 1.6 = 4.2 (76%)
dogs 8.6 / 8.1 = 1.1 (6%) 1.7 / 0.51 = 3.3 (69%) 5.4 / 1.5 = 3.6 (72%)
hudson_lynx_hare 15.5 / 16.4 = 0.94 (-6%) 4.9 / 1.0 = 4.7 (79%) 16.1 / 5.8 = 2.8 (64%)
diamonds 34.6 / 18.2 = 1.9 (47%) 3.3 / 0.54 = 6.1 (83%) 39.6 / 3.2 = 12.4 (92%)
mcycle_gp inf / 5479 = inf (100%) inf / 4.5 = inf (100%) inf / 136 = inf (100%)
sir 11.4 / 12.6 = 0.91 (-10%) 14.3 / 2.1 = 6.9 (86%) 13.8 / 6.8 = 2.0 (51%)
mcycle_splines inf / 7499 = inf (100%) inf / 34.7 = inf (100%) inf / 999 = inf (100%)
rstan_downloads 174 / 162 = 1.1 (7%) 26.2 / 9.5 = 2.7 (64%) 154 / 52.3 = 3.0 (66%)
chem_group 21.0 / 17.7 = 1.2 (16%) 74.9 / 19.0 = 3.9 (75%) 7.4 / 2.8 = 2.7 (63%)
radon_all 192 / 207 = 0.92 (-8%) 237 / 34.7 = 6.8 (85%) 71.6 / 13.1 = 5.5 (82%)
eight_fish 788 / 40.6 = 19.4 (95%) 6786 / 211 = 32.1 (97%) 254 / 14.8 = 17.1 (94%)
monster inf / 337 = inf (100%) inf / 803 = inf (100%) inf / 37.7 = inf (100%)

Cheers

PS: I’ll update this table once more results come in.

4 Likes

Exciting results! Is your new warmup a variant of campfire or a wholly new approach?

BTW if it’s helpful at all to have synthetic data, I’ve started posting some models here for the GSOC benchmarking/posteriorDB project. So far just a few variants of the multivariate hierarchical models (akin to that presented in SUG 1.13, but without the group-level predictors (yet)), with the variants differing at the level of the observation model (binomial, gaussian, location-scale gaussian). The data generation R code can yield data that fits best with either centered or non-centered, and both parameterizations are there are separate models. Should be adding group-level predictors-only and both-individual-and-group-predictors models this weekend (as well as some GP stuff).

2 Likes

Congratulations!

Any plan for making it “ready for public comsumption”? Or should we settle for the current warmup for a while?

1 Like

Like campfire I pool draws across chains to estimate the (co)variance (and have the associated problems with “truly” multimodal posteriors), however unlike campfire (as I understood it) I do not (yet) aim for some minimal N_eff before starting sampling. It is quite different from campfire I believe.

Yes, this would be very helpful. I’ve seen your repository, but didn’t include the models because I think that for a fair comparison in addition to the Stan model I would need a fixed dataset (json please) and recommended sampler settings. The settings probably should not be the ultra-gold-standard settings as for posteriordb, but “just” regular “production” settings. (It annoys me that the default settings for mcycle_gp and mcycle_splines don’t work, which makes a comparison difficult.)

The most interesting thing IMO would be to have one model which can include both parametrizations (preferably being able to continuously interpolate between them) and then provide “recommended” settings for the regular warmup while my warmup adapts on its own.

More models is always better, but note that I’ve originally developed it for ODE models (“few” parameters, little to medium amount of data, expensive leapfrog iterations and potential to tune the approximation). I’m surprised that it works so well out of the box for the other models. I’d actually be interested in optimizing it for GPs, but I have no experience whatsoever with them. I’d believe there is some room for improvement for GPs.

Hm, I won’t have time to publish it for at least two months. The whole code will need a complete overhaul/refactoring, as it has grown rather awkwardly. There are also some minor and major improvements that I want to implement, but which will need the refactoring.

Edit: Two months not because it’s so much code, but because I’ll be busy with other things.

As a very later response to @Funko_Unko 's question here, in the on-going ACoP12 conference we have a poster showing a benchmark for cross-chain warmup and multiple(>4)-chain efficiency. acop_2021_poster.pdf (138.0 KB)

tl;dr
Combing cross-chain warmup and a large number (>4) of chains could be an efficient strategy to scale ESS/time. Even though the warmup quality may suffer (but not significantly) when the number of chains is large the reduced post-warmup sampling time makes up this loss.

4 Likes