After no response here I thought I’d post a preliminary benchmark for my warm-up procedure, evaluated for some models from posteriordb and some additional models (@yizhang’s chemical reaction model, the monster model and @palmer’s bioenergetic ODE model).
Everything was run on my local notebook (which has 6 physical cores and enough RAM) with 6 parallel cores and default settings, except (only for the regular warm-up) for the eight schools model ( adapt_delta=.9
) and the chemical reaction model (custom inits). My warm-up uses default settings for all models and does neither need nor benefit substantially from custom inits, avoiding spurious modes on its own.
For all models I’ve run the regular and custom warm-up both with dense and with diagonal metrics, and for comparison selected the better performing metric (by posterior+method). Due to computational constraints I’ve only run a few (1-3) of the expensive models and more (<20) of the cheaper models.
For the comparison I only used runs which have converged without issues (only an issue for the regular warm-up). For datasets mcycle_splines
and mcycle_gp
none of the regular runs had no divergences (I’m assuming the models work with higher adapt_delta
). I’ve started the regular run of the Monster model this morning and after roughly 5 hours it stands at warm-up iteration 234, using a dense metric.
Personally, I neither care about cheap models (<1s runtime) nor about marginal speed-ups. The cheap models are included as a check that sampling performance is not negatively affected by the custom warm-up procedure.
Although @avehtari said
I am including exactly these distracting warm-up wall times, because the total number of leapfrog steps is difficult to compare because some of my leapfrog steps are much cheaper than the final ones. I also report the total number of (effective) leapfrog iterations during warm-up, which (for most models) correlates nicely with the warm-up wall times.
In summary, I report three metrics (higher is always worse), averaged for each posterior and method:
max(sampling_leapfrog/neff)
: The total number of leapfrog steps across chains divided bystansummary
’s minimalN_eff
.warmup_wall_time
: The (maximal) warm-up wall time in seconds.warmup_leapfrog/chain [k]
The total number of effective leapfrog steps per chain divided by thousand.
For each metric I report [regular]/[incremental] = [ratio] ([improvement = 1-incremental/regular])
. Improvement (e.g. speed-up)
- for
max(sampling_leapfrog/neff)
is usually positive and ranges from -17% to +99%, - for
warmup_wall_time
is always positive and ranges from +15% to + 97% and - for
warmup_leapfrog/chain [k]
is also always positive and ranges from +50% to + 99%.
The summarized data, sorted my regular wall time:
max(sampling_leapfrog/neff) | warmup_wall_time [s] | warmup_leapfrog/chain [k] | |
---|---|---|---|
dataset | |||
eight_schools | 32.3 / 31.5 = 1.0 (3%) | 0.075 / 0.026 = 2.9 (65%) | 12.4 / 3.0 = 4.2 (76%) |
sblrc | 71.2 / 11.7 = 6.1 (84%) | 0.11 / 0.086 = 1.2 (19%) | 16.9 / 2.9 = 5.8 (83%) |
garch | 20.0 / 22.1 = 0.91 (-10%) | 0.24 / 0.066 = 3.7 (73%) | 8.3 / 1.3 = 6.5 (85%) |
arma | 13.1 / 10.6 = 1.2 (19%) | 0.31 / 0.051 = 6.1 (84%) | 12.2 / 1.2 = 10.0 (90%) |
arK | 13.3 / 15.0 = 0.89 (-13%) | 0.29 / 0.11 = 2.6 (61%) | 8.5 / 2.1 = 4.1 (76%) |
kilpisjarvi_mod | 689 / 7.5 = 92.3 (99%) | 1.2 / 0.037 = 33.3 (97%) | 434 / 2.1 = 210 (100%) |
earnings | 9.0 / 7.7 = 1.2 (14%) | 1.3 / 0.098 = 13.0 (92%) | 45.7 / 1.6 = 28.1 (96%) |
gp_pois_regr | 276 / 323 = 0.85 (-17%) | 0.51 / 0.43 = 1.2 (15%) | 54.2 / 27.3 = 2.0 (50%) |
low_dim_gauss_mix | 11.7 / 8.4 = 1.4 (28%) | 1.7 / 0.51 = 3.2 (69%) | 6.6 / 1.6 = 4.2 (76%) |
dogs | 8.6 / 8.1 = 1.1 (6%) | 1.7 / 0.51 = 3.3 (69%) | 5.4 / 1.5 = 3.6 (72%) |
hudson_lynx_hare | 15.5 / 16.4 = 0.94 (-6%) | 4.9 / 1.0 = 4.7 (79%) | 16.1 / 5.8 = 2.8 (64%) |
diamonds | 34.6 / 18.2 = 1.9 (47%) | 3.3 / 0.54 = 6.1 (83%) | 39.6 / 3.2 = 12.4 (92%) |
mcycle_gp | inf / 5479 = inf (100%) | inf / 4.5 = inf (100%) | inf / 136 = inf (100%) |
sir | 11.4 / 12.6 = 0.91 (-10%) | 14.3 / 2.1 = 6.9 (86%) | 13.8 / 6.8 = 2.0 (51%) |
mcycle_splines | inf / 7499 = inf (100%) | inf / 34.7 = inf (100%) | inf / 999 = inf (100%) |
rstan_downloads | 174 / 162 = 1.1 (7%) | 26.2 / 9.5 = 2.7 (64%) | 154 / 52.3 = 3.0 (66%) |
chem_group | 21.0 / 17.7 = 1.2 (16%) | 74.9 / 19.0 = 3.9 (75%) | 7.4 / 2.8 = 2.7 (63%) |
radon_all | 192 / 207 = 0.92 (-8%) | 237 / 34.7 = 6.8 (85%) | 71.6 / 13.1 = 5.5 (82%) |
eight_fish | 788 / 40.6 = 19.4 (95%) | 6786 / 211 = 32.1 (97%) | 254 / 14.8 = 17.1 (94%) |
monster | inf / 337 = inf (100%) | inf / 803 = inf (100%) | inf / 37.7 = inf (100%) |
Cheers
PS: I’ll update this table once more results come in.