-O2 vs -O3 compiler optimization level

Has anyone tested -O3 vs. -O2? From the thinLTO thread, it seems like -O0 vs -O3 has been tested. Since it doesn’t seem like there’s a consensus on benchmarks yet, I used same benchmarks that I used for thinLTO and found a speed (on average) when using -O2 instead of -O3 (same machine and compiler as before, 4.2 GHz Kaby Lake, clang 6). Full results below for the current settings as well as with thinLTO:

-O2 vs -O3, ratio of runtimes (O3 / O2):

('examples/example-models/bugs_examples/vol1/inhalers/inhalers.stan', 1.01)
('examples/example-models/bugs_examples/vol1/dyes/dyes.stan', 1.03)
('examples/example-models/bugs_examples/vol1/litter/litter.stan', 1.07)
('examples/example-models/bugs_examples/vol1/litter/litter_old_param.stan', 1.03)
('examples/example-models/bugs_examples/vol1/seeds/seeds.stan', 1.06)
('examples/example-models/bugs_examples/vol1/seeds/seeds_centered.stan', 1.09)
('examples/example-models/bugs_examples/vol1/seeds/seeds_stanified.stan', 1.06)
('examples/example-models/bugs_examples/vol1/oxford/oxford.stan', 1.06)
('examples/example-models/bugs_examples/vol1/salm/salm.stan', 0.99)
('examples/example-models/bugs_examples/vol1/salm/salm2.stan', 1.0)
('examples/example-models/bugs_examples/vol1/bones/bones.stan', 0.96)
('examples/example-models/bugs_examples/vol1/equiv/equiv.stan', 1.04)
('examples/example-models/bugs_examples/vol1/surgical/surgical.stan', 0.95)
('examples/example-models/bugs_examples/vol1/surgical/surgical_stanified.stan', 0.98)
('examples/example-models/bugs_examples/vol1/pump/pump.stan', 0.99)
('examples/example-models/bugs_examples/vol1/epil/epil.stan', 1.04)
('examples/example-models/bugs_examples/vol1/stacks/stacks_e_dexp_ridge.stan', 1.04)
('examples/example-models/bugs_examples/vol1/stacks/stacks_a_normal.stan', 0.99)
('examples/example-models/bugs_examples/vol1/stacks/stacks_b_dexp.stan', 0.98)
('examples/example-models/bugs_examples/vol1/stacks/stacks_d_normal_ridge.stan', 0.99)
('examples/example-models/bugs_examples/vol1/stacks/stacks_c_t4.stan', 0.98)
('examples/example-models/bugs_examples/vol1/stacks/stacks_f_t4_ridge.stan', 0.97)
('examples/example-models/bugs_examples/vol1/blocker/blocker.stan', 1.01)
('examples/example-models/bugs_examples/vol1/leukfr/leukfr.stan', 1.18)
('examples/example-models/bugs_examples/vol1/lsat/lsat.stan', 1.1)
('examples/example-models/bugs_examples/vol1/kidney/kidney.stan', 1.02)
('examples/example-models/bugs_examples/vol1/magnesium/magnesium.stan', 1.06)
('examples/example-models/bugs_examples/vol1/rats/rats.stan', 1.07)
('examples/example-models/bugs_examples/vol1/rats/rats_vec.stan', 1.26)
('examples/example-models/bugs_examples/vol1/rats/rats_vec_unit.stan', 1.07)
('examples/example-models/bugs_examples/vol1/leuk/leuk.stan', 1.33)
('examples/example-models/bugs_examples/vol1/mice/mice.stan', 1.02)
('examples/example-models/bugs_examples/vol3/data_cloning/seeds.stan', 1.01)
('examples/example-models/bugs_examples/vol3/hepatitis/hepatitis.stan', 1.31)
('examples/example-models/bugs_examples/vol3/hepatitis/hepatitisME.stan', 1.06)

Geometric mean is 1.05

with thinLTO for both, O3 vs O2:

('examples/example-models/bugs_examples/vol1/inhalers/inhalers.stan', 1.08)
('examples/example-models/bugs_examples/vol1/dyes/dyes.stan', 1.03)
('examples/example-models/bugs_examples/vol1/litter/litter.stan', 1.05)
('examples/example-models/bugs_examples/vol1/litter/litter_old_param.stan', 1.04)
('examples/example-models/bugs_examples/vol1/seeds/seeds.stan', 1.09)
('examples/example-models/bugs_examples/vol1/seeds/seeds_centered.stan', 1.09)
('examples/example-models/bugs_examples/vol1/seeds/seeds_stanified.stan', 1.08)
('examples/example-models/bugs_examples/vol1/oxford/oxford.stan', 1.08)
('examples/example-models/bugs_examples/vol1/salm/salm.stan', 1.02)
('examples/example-models/bugs_examples/vol1/salm/salm2.stan', 1.03)
('examples/example-models/bugs_examples/vol1/bones/bones.stan', 1.04)
('examples/example-models/bugs_examples/vol1/equiv/equiv.stan', 1.09)
('examples/example-models/bugs_examples/vol1/surgical/surgical.stan', 1.02)
('examples/example-models/bugs_examples/vol1/surgical/surgical_stanified.stan', 1.01)
('examples/example-models/bugs_examples/vol1/pump/pump.stan', 1.02)
('examples/example-models/bugs_examples/vol1/epil/epil.stan', 1.11)
('examples/example-models/bugs_examples/vol1/stacks/stacks_e_dexp_ridge.stan', 1.1)
('examples/example-models/bugs_examples/vol1/stacks/stacks_a_normal.stan', 1.0)
('examples/example-models/bugs_examples/vol1/stacks/stacks_b_dexp.stan', 1.02)
('examples/example-models/bugs_examples/vol1/stacks/stacks_d_normal_ridge.stan', 1.01)
('examples/example-models/bugs_examples/vol1/stacks/stacks_c_t4.stan', 1.01)
('examples/example-models/bugs_examples/vol1/stacks/stacks_f_t4_ridge.stan', 1.02)
('examples/example-models/bugs_examples/vol1/blocker/blocker.stan', 1.01)
('examples/example-models/bugs_examples/vol1/leukfr/leukfr.stan', 1.04)
('examples/example-models/bugs_examples/vol1/lsat/lsat.stan', 1.13)
('examples/example-models/bugs_examples/vol1/kidney/kidney.stan', 1.07)
('examples/example-models/bugs_examples/vol1/magnesium/magnesium.stan', 1.12)
('examples/example-models/bugs_examples/vol1/rats/rats.stan', 1.1)
('examples/example-models/bugs_examples/vol1/rats/rats_vec.stan', 1.06)
('examples/example-models/bugs_examples/vol1/rats/rats_vec_unit.stan', 1.13)
('examples/example-models/bugs_examples/vol1/leuk/leuk.stan', 1.05)
('examples/example-models/bugs_examples/vol1/mice/mice.stan', 1.06)
('examples/example-models/bugs_examples/vol3/data_cloning/seeds.stan', 1.02)
('examples/example-models/bugs_examples/vol3/hepatitis/hepatitis.stan', 1.11)
('examples/example-models/bugs_examples/vol3/hepatitis/hepatitisME.stan', 1.18)

1.05970531051

Unlike before, I only did 20 runs of each benchmark because it takes a pretty long time to do 100 runs :-). I didn’t try running multiple chains, but if I had to bet, I’d bet that the performance delta should increase when multi-threaded.

If it seems weird that O2 could be faster than O3, an intuition for why this is somewhat common is that O3 does a lot of optimizations that increase code side (e.g., more aggressive loop unrolling), which almost always increases performance in microbenchmarks, but can sometimes decrease performance in workloads with a larger code footprint.

If this is what’s going on here, we should be able to see this by looking at perf counters. I just spot checked a single benchmark (bugs_examples/vol3/hepatitis/hepatitisME) and looked at the frontend_retired.l1i_miss counter on six runs:

thinLTO, O2 (lower is better, in that it indicates fewer icache misses):

94,653,414
93,496,761
89,757,237
94,649,405
94,649,405
97,344,701

thinLTO, O3:

105,720,505
105,720,505
102,103,298
117,330,631
105,744,625
100,865,739

Rather than try to look at more counters by hand, it probably makes more sense to change @seantalts 's perf script to report results from something like perf or likwid for counters that we might want to query.

IMO, the other big thing that’s missing here is results from gcc, which is another thing we can easily get with some small changes to @seantalts 's script.

It should be possible to get even better results by selectively enabling the optimizations that are most useful. It’s possible to do this by hand, but I suspect it will be easier and that we’ll get better results by using PGO. Unless someone else is already doing this, I’ll probably mess with this later this week or this weekend.

I’m not proposing a default change to -O2, but I figured it was worth posting preliminary results to start the discussion in case there are known cases where -O3 totally demolishes -O2 or if there are other things to consider.

1 Like

It has been a while since we looked at -O2 vs -O3 so what we believed before is not necessarily applicable to the latest compilers or when -std=c++14.

1 Like

I’d be interested in seeing the effect of -march=native as well. Anecdotally, I have seen 30% speedup by using this, which enables CPU specific instructions.

1 Like

funny story, we have a test that fails on -O2: https://github.com/stan-dev/math/issues/806

Thanks—that’s a very helpful intuition.

We never looked into this again after Matt Hoffman told us it was a huge pain (he spent an internship at Google working on PGO, but that was at least 8 years ago).

If -O2 compiles faster and produces smaller output code and the time penalty is in the noise for end-to-end runs, then by all means we should switch to -O2.

Definitely.

I’ve always had that turned on. Didn’t realize it was that significant. Is that for code with lots of big matrix ops? Those will be the places with lots of double-based loops to unroll.

Looks like maybe we should be running different optimization-level tests at least from time-to-time to catch this stuff with minimal effort.

I haven’t measured compile times yet, but it would be surprising if compile-time was slower and it seems like runtime is better. I haven’t measured compile-time yet because these measurements were initially done before the perf branch of cmdstan had compile-time measurements and I haven’t had an idle machine to run benchmarks on since then, but I’ll try to make sure I do this before I start a job and get busy.

Also, are there any other models you’d suggest running before I submit a PR for this? The PR should be trivial, but I want to make sure I’m not missing something obvious from the measurement side.

There are a bunch of different cases. Something dominated by matrix algebra and vectorized densities and vectorized unary functions would be good. All the implementations depend on compiler optimization—basically any kind of expression template will.

@seantalts has been thinking about performance regression testing, so he’s the one to ask about setting this up at this point.

I went ahead and made a new repo for the performance test tools (with model repos as submodules) here: https://github.com/stan-dev/performance-tests-cmdstan

It has a readme that shows how to compare two git hashes. I would probably use stat_comp_benchmarks models for this, as there is sort of open debate as to how important the other ones are for evaluating a potential performance improvement that isn’t universally better.

I just double checked this optimization, looks like it is more on the order of 10-15% speedup. I’m not doing a lot of Stan matrix ops, but I have a lot of loops in my code to deal with tridiagonal matrices, etc.