Has anyone tested -O3 vs. -O2? From the thinLTO thread, it seems like -O0 vs -O3 has been tested. Since it doesn’t seem like there’s a consensus on benchmarks yet, I used same benchmarks that I used for thinLTO and found a speed (on average) when using -O2 instead of -O3 (same machine and compiler as before, 4.2 GHz Kaby Lake, clang 6). Full results below for the current settings as well as with thinLTO:
-O2 vs -O3, ratio of runtimes (O3 / O2):
('examples/example-models/bugs_examples/vol1/inhalers/inhalers.stan', 1.01)
('examples/example-models/bugs_examples/vol1/dyes/dyes.stan', 1.03)
('examples/example-models/bugs_examples/vol1/litter/litter.stan', 1.07)
('examples/example-models/bugs_examples/vol1/litter/litter_old_param.stan', 1.03)
('examples/example-models/bugs_examples/vol1/seeds/seeds.stan', 1.06)
('examples/example-models/bugs_examples/vol1/seeds/seeds_centered.stan', 1.09)
('examples/example-models/bugs_examples/vol1/seeds/seeds_stanified.stan', 1.06)
('examples/example-models/bugs_examples/vol1/oxford/oxford.stan', 1.06)
('examples/example-models/bugs_examples/vol1/salm/salm.stan', 0.99)
('examples/example-models/bugs_examples/vol1/salm/salm2.stan', 1.0)
('examples/example-models/bugs_examples/vol1/bones/bones.stan', 0.96)
('examples/example-models/bugs_examples/vol1/equiv/equiv.stan', 1.04)
('examples/example-models/bugs_examples/vol1/surgical/surgical.stan', 0.95)
('examples/example-models/bugs_examples/vol1/surgical/surgical_stanified.stan', 0.98)
('examples/example-models/bugs_examples/vol1/pump/pump.stan', 0.99)
('examples/example-models/bugs_examples/vol1/epil/epil.stan', 1.04)
('examples/example-models/bugs_examples/vol1/stacks/stacks_e_dexp_ridge.stan', 1.04)
('examples/example-models/bugs_examples/vol1/stacks/stacks_a_normal.stan', 0.99)
('examples/example-models/bugs_examples/vol1/stacks/stacks_b_dexp.stan', 0.98)
('examples/example-models/bugs_examples/vol1/stacks/stacks_d_normal_ridge.stan', 0.99)
('examples/example-models/bugs_examples/vol1/stacks/stacks_c_t4.stan', 0.98)
('examples/example-models/bugs_examples/vol1/stacks/stacks_f_t4_ridge.stan', 0.97)
('examples/example-models/bugs_examples/vol1/blocker/blocker.stan', 1.01)
('examples/example-models/bugs_examples/vol1/leukfr/leukfr.stan', 1.18)
('examples/example-models/bugs_examples/vol1/lsat/lsat.stan', 1.1)
('examples/example-models/bugs_examples/vol1/kidney/kidney.stan', 1.02)
('examples/example-models/bugs_examples/vol1/magnesium/magnesium.stan', 1.06)
('examples/example-models/bugs_examples/vol1/rats/rats.stan', 1.07)
('examples/example-models/bugs_examples/vol1/rats/rats_vec.stan', 1.26)
('examples/example-models/bugs_examples/vol1/rats/rats_vec_unit.stan', 1.07)
('examples/example-models/bugs_examples/vol1/leuk/leuk.stan', 1.33)
('examples/example-models/bugs_examples/vol1/mice/mice.stan', 1.02)
('examples/example-models/bugs_examples/vol3/data_cloning/seeds.stan', 1.01)
('examples/example-models/bugs_examples/vol3/hepatitis/hepatitis.stan', 1.31)
('examples/example-models/bugs_examples/vol3/hepatitis/hepatitisME.stan', 1.06)
Geometric mean is 1.05
with thinLTO for both, O3 vs O2:
('examples/example-models/bugs_examples/vol1/inhalers/inhalers.stan', 1.08)
('examples/example-models/bugs_examples/vol1/dyes/dyes.stan', 1.03)
('examples/example-models/bugs_examples/vol1/litter/litter.stan', 1.05)
('examples/example-models/bugs_examples/vol1/litter/litter_old_param.stan', 1.04)
('examples/example-models/bugs_examples/vol1/seeds/seeds.stan', 1.09)
('examples/example-models/bugs_examples/vol1/seeds/seeds_centered.stan', 1.09)
('examples/example-models/bugs_examples/vol1/seeds/seeds_stanified.stan', 1.08)
('examples/example-models/bugs_examples/vol1/oxford/oxford.stan', 1.08)
('examples/example-models/bugs_examples/vol1/salm/salm.stan', 1.02)
('examples/example-models/bugs_examples/vol1/salm/salm2.stan', 1.03)
('examples/example-models/bugs_examples/vol1/bones/bones.stan', 1.04)
('examples/example-models/bugs_examples/vol1/equiv/equiv.stan', 1.09)
('examples/example-models/bugs_examples/vol1/surgical/surgical.stan', 1.02)
('examples/example-models/bugs_examples/vol1/surgical/surgical_stanified.stan', 1.01)
('examples/example-models/bugs_examples/vol1/pump/pump.stan', 1.02)
('examples/example-models/bugs_examples/vol1/epil/epil.stan', 1.11)
('examples/example-models/bugs_examples/vol1/stacks/stacks_e_dexp_ridge.stan', 1.1)
('examples/example-models/bugs_examples/vol1/stacks/stacks_a_normal.stan', 1.0)
('examples/example-models/bugs_examples/vol1/stacks/stacks_b_dexp.stan', 1.02)
('examples/example-models/bugs_examples/vol1/stacks/stacks_d_normal_ridge.stan', 1.01)
('examples/example-models/bugs_examples/vol1/stacks/stacks_c_t4.stan', 1.01)
('examples/example-models/bugs_examples/vol1/stacks/stacks_f_t4_ridge.stan', 1.02)
('examples/example-models/bugs_examples/vol1/blocker/blocker.stan', 1.01)
('examples/example-models/bugs_examples/vol1/leukfr/leukfr.stan', 1.04)
('examples/example-models/bugs_examples/vol1/lsat/lsat.stan', 1.13)
('examples/example-models/bugs_examples/vol1/kidney/kidney.stan', 1.07)
('examples/example-models/bugs_examples/vol1/magnesium/magnesium.stan', 1.12)
('examples/example-models/bugs_examples/vol1/rats/rats.stan', 1.1)
('examples/example-models/bugs_examples/vol1/rats/rats_vec.stan', 1.06)
('examples/example-models/bugs_examples/vol1/rats/rats_vec_unit.stan', 1.13)
('examples/example-models/bugs_examples/vol1/leuk/leuk.stan', 1.05)
('examples/example-models/bugs_examples/vol1/mice/mice.stan', 1.06)
('examples/example-models/bugs_examples/vol3/data_cloning/seeds.stan', 1.02)
('examples/example-models/bugs_examples/vol3/hepatitis/hepatitis.stan', 1.11)
('examples/example-models/bugs_examples/vol3/hepatitis/hepatitisME.stan', 1.18)
1.05970531051
Unlike before, I only did 20 runs of each benchmark because it takes a pretty long time to do 100 runs :-). I didn’t try running multiple chains, but if I had to bet, I’d bet that the performance delta should increase when multi-threaded.
If it seems weird that O2 could be faster than O3, an intuition for why this is somewhat common is that O3 does a lot of optimizations that increase code side (e.g., more aggressive loop unrolling), which almost always increases performance in microbenchmarks, but can sometimes decrease performance in workloads with a larger code footprint.
If this is what’s going on here, we should be able to see this by looking at perf counters. I just spot checked a single benchmark (bugs_examples/vol3/hepatitis/hepatitisME) and looked at the frontend_retired.l1i_miss
counter on six runs:
thinLTO, O2 (lower is better, in that it indicates fewer icache misses):
94,653,414
93,496,761
89,757,237
94,649,405
94,649,405
97,344,701
thinLTO, O3:
105,720,505
105,720,505
102,103,298
117,330,631
105,744,625
100,865,739
Rather than try to look at more counters by hand, it probably makes more sense to change @seantalts 's perf script to report results from something like perf
or likwid
for counters that we might want to query.
IMO, the other big thing that’s missing here is results from gcc, which is another thing we can easily get with some small changes to @seantalts 's script.
It should be possible to get even better results by selectively enabling the optimizations that are most useful. It’s possible to do this by hand, but I suspect it will be easier and that we’ll get better results by using PGO. Unless someone else is already doing this, I’ll probably mess with this later this week or this weekend.
I’m not proposing a default change to -O2, but I figured it was worth posting preliminary results to start the discussion in case there are known cases where -O3 totally demolishes -O2 or if there are other things to consider.