I’ve been using @seantalts 's code in the “perf” branch of cmdstan to do some performance tests, and it seems like thinLTO (a clang feature that tries to get most of the benefit of LTO without impacting compile-time) gives a noticable improvement.
Since I’m new around here, my questions are:
- What’s the standard of evidence used to evaluate performance changes?
- Is there a standard set of benchmarks that should be run?
Using @seantalts 's scripts, here’s the ratio of runtimes for a variety of examples that are included in cmdstan (normal / thinLTO, so > 1 means some gain for thinLTO, < 1 means some loss). Each example was repeated 100 times and the times were averaged 100 times. On (2), there are probably better benchmarks to run? But these were really easy to run using @seantalts 's script, so I used these.
(‘examples/example-models/bugs_examples/vol1/inhalers/inhalers.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/dyes/dyes.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol1/litter/litter.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/litter/litter_old_param.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/seeds/seeds.stan’, 1.03)
(‘examples/example-models/bugs_examples/vol1/seeds/seeds_centered.stan’, 1.03)
(‘examples/example-models/bugs_examples/vol1/seeds/seeds_stanified.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/oxford/oxford.stan’, 1.01)
(‘examples/example-models/bugs_examples/vol1/salm/salm.stan’, 1.12)
(‘examples/example-models/bugs_examples/vol1/salm/salm2.stan’, 1.11)
(‘examples/example-models/bugs_examples/vol1/bones/bones.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/equiv/equiv.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol1/surgical/surgical.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/surgical/surgical_stanified.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/pump/pump.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/epil/epil.stan’, 1.11)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_e_dexp_ridge.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_a_normal.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_b_dexp.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_d_normal_ridge.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_c_t4.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_f_t4_ridge.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/blocker/blocker.stan’, 1.03)
(‘examples/example-models/bugs_examples/vol1/leukfr/leukfr.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol1/lsat/lsat.stan’, 0.99)
(‘examples/example-models/bugs_examples/vol1/kidney/kidney.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/magnesium/magnesium.stan’, 1.03)
(‘examples/example-models/bugs_examples/vol1/rats/rats.stan’, 1.02)
(‘examples/example-models/bugs_examples/vol1/rats/rats_vec.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/rats/rats_vec_unit.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/leuk/leuk.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol1/mice/mice.stan’, 1.02)
(‘examples/example-models/bugs_examples/vol3/data_cloning/seeds.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol3/hepatitis/hepatitis.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol3/hepatitis/hepatitisME.stan’, 1.04)
The geometric mean of these ratios is 1.05.
This test was done on clang 6 (thinLTO was released with clang 3.9) on a 4.2 GHz Kaby Lake machine.
On (1), some reasons this benchmark might be invalid are:
(A) these examples may not be representative of “real” workloads. If this is the case, I can re-run with other benchmarks
(B) compile time wasn’t measured. Although the goal of thinLTO is to provide the benefit of LTO without significantly impacting compile time, it’s possible that compile time was significantly slowed down. I think that either @seantalts or I will edit his scripts to also measure compile time, so we should have numbers on this at some point
© clang 6.0 is recent and relatively few people are using such a new version of clang. It’s possible that there have been significant changes to thinLTO and that someone using an older clang may see less benefit.
(D) cmdstan may link things differently than pystan or rstan, which may impact the results.
(E) these results are running single-threaded and may not generalize to the multi-threaded case. For this optimization, I don’t see any reason to expect worse results when multi-threaded, but you never know.
If these benchmarks are considered good enough, I’d be happy to submit a PR, but if there’s some canonical set of benchmarks I should run, I certainly don’t mind doing that.