Preliminary benchmark: incremental and adaptive parallel warm-up

Funko_Unko · July 23, 2021, 10:37am

After no response here I thought I’d post a preliminary benchmark for my warm-up procedure, evaluated for some models from posteriordb and some additional models (@yizhang’s chemical reaction model, the monster model and @palmer’s bioenergetic ODE model).

Everything was run on my local notebook (which has 6 physical cores and enough RAM) with 6 parallel cores and default settings, except (only for the regular warm-up) for the eight schools model ( adapt_delta=.9) and the chemical reaction model (custom inits). My warm-up uses default settings for all models and does neither need nor benefit substantially from custom inits, avoiding spurious modes on its own.

For all models I’ve run the regular and custom warm-up both with dense and with diagonal metrics, and for comparison selected the better performing metric (by posterior+method). Due to computational constraints I’ve only run a few (1-3) of the expensive models and more (<20) of the cheaper models.

For the comparison I only used runs which have converged without issues (only an issue for the regular warm-up). For datasets mcycle_splines and mcycle_gp none of the regular runs had no divergences (I’m assuming the models work with higher adapt_delta). I’ve started the regular run of the Monster model this morning and after roughly 5 hours it stands at warm-up iteration 234, using a dense metric.

Personally, I neither care about cheap models (<1s runtime) nor about marginal speed-ups. The cheap models are included as a check that sampling performance is not negatively affected by the custom warm-up procedure.

Although @avehtari said

I am including exactly these distracting warm-up wall times, because the total number of leapfrog steps is difficult to compare because some of my leapfrog steps are much cheaper than the final ones. I also report the total number of (effective) leapfrog iterations during warm-up, which (for most models) correlates nicely with the warm-up wall times.

In summary, I report three metrics (higher is always worse), averaged for each posterior and method:

max(sampling_leapfrog/neff): The total number of leapfrog steps across chains divided by stansummary’s minimal N_eff.
warmup_wall_time: The (maximal) warm-up wall time in seconds.
warmup_leapfrog/chain [k] The total number of effective leapfrog steps per chain divided by thousand.

For each metric I report [regular]/[incremental] = [ratio] ([improvement = 1-incremental/regular]). Improvement (e.g. speed-up)

for max(sampling_leapfrog/neff) is usually positive and ranges from -17% to +99%,
for warmup_wall_time is always positive and ranges from +15% to + 97% and
for warmup_leapfrog/chain [k] is also always positive and ranges from +50% to + 99%.

The summarized data, sorted my regular wall time:

	max(sampling_leapfrog/neff)	warmup_wall_time [s]	warmup_leapfrog/chain [k]
dataset
eight_schools	32.3 / 31.5 = 1.0 (3%)	0.075 / 0.026 = 2.9 (65%)	12.4 / 3.0 = 4.2 (76%)
sblrc	71.2 / 11.7 = 6.1 (84%)	0.11 / 0.086 = 1.2 (19%)	16.9 / 2.9 = 5.8 (83%)
garch	20.0 / 22.1 = 0.91 (-10%)	0.24 / 0.066 = 3.7 (73%)	8.3 / 1.3 = 6.5 (85%)
arma	13.1 / 10.6 = 1.2 (19%)	0.31 / 0.051 = 6.1 (84%)	12.2 / 1.2 = 10.0 (90%)
arK	13.3 / 15.0 = 0.89 (-13%)	0.29 / 0.11 = 2.6 (61%)	8.5 / 2.1 = 4.1 (76%)
kilpisjarvi_mod	689 / 7.5 = 92.3 (99%)	1.2 / 0.037 = 33.3 (97%)	434 / 2.1 = 210 (100%)
earnings	9.0 / 7.7 = 1.2 (14%)	1.3 / 0.098 = 13.0 (92%)	45.7 / 1.6 = 28.1 (96%)
gp_pois_regr	276 / 323 = 0.85 (-17%)	0.51 / 0.43 = 1.2 (15%)	54.2 / 27.3 = 2.0 (50%)
low_dim_gauss_mix	11.7 / 8.4 = 1.4 (28%)	1.7 / 0.51 = 3.2 (69%)	6.6 / 1.6 = 4.2 (76%)
dogs	8.6 / 8.1 = 1.1 (6%)	1.7 / 0.51 = 3.3 (69%)	5.4 / 1.5 = 3.6 (72%)
hudson_lynx_hare	15.5 / 16.4 = 0.94 (-6%)	4.9 / 1.0 = 4.7 (79%)	16.1 / 5.8 = 2.8 (64%)
diamonds	34.6 / 18.2 = 1.9 (47%)	3.3 / 0.54 = 6.1 (83%)	39.6 / 3.2 = 12.4 (92%)
mcycle_gp	inf / 5479 = inf (100%)	inf / 4.5 = inf (100%)	inf / 136 = inf (100%)
sir	11.4 / 12.6 = 0.91 (-10%)	14.3 / 2.1 = 6.9 (86%)	13.8 / 6.8 = 2.0 (51%)
mcycle_splines	inf / 7499 = inf (100%)	inf / 34.7 = inf (100%)	inf / 999 = inf (100%)
rstan_downloads	174 / 162 = 1.1 (7%)	26.2 / 9.5 = 2.7 (64%)	154 / 52.3 = 3.0 (66%)
chem_group	21.0 / 17.7 = 1.2 (16%)	74.9 / 19.0 = 3.9 (75%)	7.4 / 2.8 = 2.7 (63%)
radon_all	192 / 207 = 0.92 (-8%)	237 / 34.7 = 6.8 (85%)	71.6 / 13.1 = 5.5 (82%)
eight_fish	788 / 40.6 = 19.4 (95%)	6786 / 211 = 32.1 (97%)	254 / 14.8 = 17.1 (94%)
monster	inf / 337 = inf (100%)	inf / 803 = inf (100%)	inf / 37.7 = inf (100%)

Cheers

PS: I’ll update this table once more results come in.

mike-lawrence · July 23, 2021, 1:37pm

Exciting results! Is your new warmup a variant of campfire or a wholly new approach?

BTW if it’s helpful at all to have synthetic data, I’ve started posting some models here for the GSOC benchmarking/posteriorDB project. So far just a few variants of the multivariate hierarchical models (akin to that presented in SUG 1.13, but without the group-level predictors (yet)), with the variants differing at the level of the observation model (binomial, gaussian, location-scale gaussian). The data generation R code can yield data that fits best with either centered or non-centered, and both parameterizations are there are separate models. Should be adding group-level predictors-only and both-individual-and-group-predictors models this weekend (as well as some GP stuff).

palmer · July 23, 2021, 2:40pm

Congratulations!

Any plan for making it “ready for public comsumption”? Or should we settle for the current warmup for a while?

Funko_Unko · July 23, 2021, 3:47pm

Like campfire I pool draws across chains to estimate the (co)variance (and have the associated problems with “truly” multimodal posteriors), however unlike campfire (as I understood it) I do not (yet) aim for some minimal N_eff before starting sampling. It is quite different from campfire I believe.

Yes, this would be very helpful. I’ve seen your repository, but didn’t include the models because I think that for a fair comparison in addition to the Stan model I would need a fixed dataset (json please) and recommended sampler settings. The settings probably should not be the ultra-gold-standard settings as for posteriordb, but “just” regular “production” settings. (It annoys me that the default settings for mcycle_gp and mcycle_splines don’t work, which makes a comparison difficult.)

The most interesting thing IMO would be to have one model which can include both parametrizations (preferably being able to continuously interpolate between them) and then provide “recommended” settings for the regular warmup while my warmup adapts on its own.

More models is always better, but note that I’ve originally developed it for ODE models (“few” parameters, little to medium amount of data, expensive leapfrog iterations and potential to tune the approximation). I’m surprised that it works so well out of the box for the other models. I’d actually be interested in optimizing it for GPs, but I have no experience whatsoever with them. I’d believe there is some room for improvement for GPs.

Hm, I won’t have time to publish it for at least two months. The whole code will need a complete overhaul/refactoring, as it has grown rather awkwardly. There are also some minor and major improvements that I want to implement, but which will need the refactoring.

Edit: Two months not because it’s so much code, but because I’ll be busy with other things.

yizhang · November 9, 2021, 6:13pm

As a very later response to @Funko_Unko 's question here, in the on-going ACoP12 conference we have a poster showing a benchmark for cross-chain warmup and multiple(>4)-chain efficiency. acop_2021_poster.pdf (138.0 KB)

tl;dr
Combing cross-chain warmup and a large number (>4) of chains could be an efficient strategy to scale ESS/time. Even though the warmup quality may suffer (but not significantly) when the number of chains is large the reduced post-warmup sampling time makes up this loss.

Topic		Replies	Views
Cross-chain warmup adaptation using MPI Algorithms mcmc	91	5339	July 2, 2021
New adaptive warmup proposal (looking for feedback)! Algorithms	50	4421	July 31, 2020
Any way to speed up warmup? General performance	5	2001	July 18, 2020
Variable running speeds across chains during warmup - possible causes Modeling	4	668	May 27, 2019
How to choose warmup length for very large models Modeling performance	18	4130	June 1, 2021

Preliminary benchmark: incremental and adaptive parallel warm-up

Related topics