thinLTO / standard benchmarks?

I’ve been using @seantalts 's code in the “perf” branch of cmdstan to do some performance tests, and it seems like thinLTO (a clang feature that tries to get most of the benefit of LTO without impacting compile-time) gives a noticable improvement.

Since I’m new around here, my questions are:

  1. What’s the standard of evidence used to evaluate performance changes?
  2. Is there a standard set of benchmarks that should be run?

Using @seantalts 's scripts, here’s the ratio of runtimes for a variety of examples that are included in cmdstan (normal / thinLTO, so > 1 means some gain for thinLTO, < 1 means some loss). Each example was repeated 100 times and the times were averaged 100 times. On (2), there are probably better benchmarks to run? But these were really easy to run using @seantalts 's script, so I used these.

(‘examples/example-models/bugs_examples/vol1/inhalers/inhalers.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/dyes/dyes.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol1/litter/litter.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/litter/litter_old_param.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/seeds/seeds.stan’, 1.03)
(‘examples/example-models/bugs_examples/vol1/seeds/seeds_centered.stan’, 1.03)
(‘examples/example-models/bugs_examples/vol1/seeds/seeds_stanified.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/oxford/oxford.stan’, 1.01)
(‘examples/example-models/bugs_examples/vol1/salm/salm.stan’, 1.12)
(‘examples/example-models/bugs_examples/vol1/salm/salm2.stan’, 1.11)
(‘examples/example-models/bugs_examples/vol1/bones/bones.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/equiv/equiv.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol1/surgical/surgical.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/surgical/surgical_stanified.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/pump/pump.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/epil/epil.stan’, 1.11)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_e_dexp_ridge.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_a_normal.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_b_dexp.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_d_normal_ridge.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_c_t4.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol1/stacks/stacks_f_t4_ridge.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/blocker/blocker.stan’, 1.03)
(‘examples/example-models/bugs_examples/vol1/leukfr/leukfr.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol1/lsat/lsat.stan’, 0.99)
(‘examples/example-models/bugs_examples/vol1/kidney/kidney.stan’, 1.04)
(‘examples/example-models/bugs_examples/vol1/magnesium/magnesium.stan’, 1.03)
(‘examples/example-models/bugs_examples/vol1/rats/rats.stan’, 1.02)
(‘examples/example-models/bugs_examples/vol1/rats/rats_vec.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/rats/rats_vec_unit.stan’, 1.05)
(‘examples/example-models/bugs_examples/vol1/leuk/leuk.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol1/mice/mice.stan’, 1.02)
(‘examples/example-models/bugs_examples/vol3/data_cloning/seeds.stan’, 1.06)
(‘examples/example-models/bugs_examples/vol3/hepatitis/hepatitis.stan’, 1.07)
(‘examples/example-models/bugs_examples/vol3/hepatitis/hepatitisME.stan’, 1.04)

The geometric mean of these ratios is 1.05.

This test was done on clang 6 (thinLTO was released with clang 3.9) on a 4.2 GHz Kaby Lake machine.

On (1), some reasons this benchmark might be invalid are:

(A) these examples may not be representative of “real” workloads. If this is the case, I can re-run with other benchmarks
(B) compile time wasn’t measured. Although the goal of thinLTO is to provide the benefit of LTO without significantly impacting compile time, it’s possible that compile time was significantly slowed down. I think that either @seantalts or I will edit his scripts to also measure compile time, so we should have numbers on this at some point
© clang 6.0 is recent and relatively few people are using such a new version of clang. It’s possible that there have been significant changes to thinLTO and that someone using an older clang may see less benefit.
(D) cmdstan may link things differently than pystan or rstan, which may impact the results.
(E) these results are running single-threaded and may not generalize to the multi-threaded case. For this optimization, I don’t see any reason to expect worse results when multi-threaded, but you never know.

If these benchmarks are considered good enough, I’d be happy to submit a PR, but if there’s some canonical set of benchmarks I should run, I certainly don’t mind doing that.

LTO also makes rstanarm build a shared object of Stan models faster. With g++, it is even better because you can parallelize the linking step across cores.

1 Like

Welcome, @danluu! And thanks for the super thoughtful post and commentary. As you’ll see, we’re still working a lot of this out ourselves.

To save other people a complicated acronym search, “LTO” means link-time optimization.

I wish this were better defined. First, we want to make sure we keep getting the right results, but I don’t think that’s such a concern with a compiler flag. Second, we want to test things that are representative, but also test microbenchmarks. I think the tests you ran are fine for what you’re evaluating. I’m really surprised there was that much to be gained over linking.

Not yet, but we need to curate such a set. @seantalts is already working on that.

We definitely need to measure this. Compile time is already a huge pain point for users and we don’t want to make it worse by default.

We want to targt future compilers, but we should probabluy make sure the existing default compilers aren’t negatively impacted (or at least not much).

We don’t have anything running multi-threaded yet, but @bgoodri’s working on OpenMP for multi-threading some of our probability functions.

Let’s wait to see what @sycklik and @seantalts have to say, as they’ve been managing a lot of this. @betanalpha may jump in as well, as he’s been managing the statistical soundness of our tests.

I’m just using our example-models repo and running anything that has an appropriately-ish named data file in the same directory. We could curate for reals but I don’t know that Stan has data on what models are “most indicative” or anything like that. I’d be tempted to just put in all the models that we like and know about. I think a good canonical store for that would be that example-models repo.

I have great news for you - the new CmdStan performance test measures and outputs compile time as one of the benchmarks it tests.

@danluu, one slight complication I don’t think I described is that there are many places where Stan is compiled with different compiler flags. Each of the interfaces allows users to set their own compile flags. CmdStan, Stan, and Math right now I believe are all using the Math repo’s makefile for defaults here for tests (in Stan and Math) and models (CmdStan). RStan and PyStan I believe have their own defaults and instructions for overriding them. This only comes up with it comes time to “change Stan’s compiler flags” - just letting you know they sort of exist in a few places.

I would like to add a bash script that, given two git commits, outputs a file showing how all of the performance tests differ across the commits (and checks for matching numerical results to some tolerance). This should be fairly easy right now. Then it’s pretty easy for developers to run the kind of experiment @danluu is right now running by hand.

Using the updated script from @seantalts, I see the following compile times (sum of compile times for all models tested):

“normal”: 242s
thinLTO: 284s:

284 / 242 = 1.17

This seems like it might be enough of a slowdown that this shouldn’t be a default option, even though people who are building longer-running models probably do want this option.

If that’s the case, what’s a good place for writing this down? From skimming the manual and the wiki, I don’t see an obvious place to mention this, but I think people who have long-running models would probably like to know that they can get a 5% speedup at the cost of seconds or tens of seconds of compile time.

That’s what I meant by “curate”. In particular, we want to stick to ones we think we can fit. I think it’d be best to start with the ones Michael has put together in the test repo and build out than start with everything and trim down in the face of failure.

Maybe we should discuss this at the meeting on Thursday or in person. I’d hate to see a 20% slowdown in model compilation, but I’d hate to waste a 5% speedup for larger models. Figuring this out would be good to do in general, because if a program’s going to be very fast to run, we also don’t need to spend all the time doing full -O3 optimization. Using -O0 doubles compile speed relative to -O3, but increases runtime by a factor of 10 or more. But if a models’ going to be run once and fit in less than a second, then -O3 is wasteful.

There should probably be a curated list of models, but for regression testing I think it’s fine to shotgun everything that we ever supported, especially since we rarely break backwards compatibility and we’d like to know when it happens.

The point is that we want tests that are reliable. Some of those models do’t fit reliably, so aren’t good for testing speed other than in terms of log density plus gradient time. For that, we can throw anything at it, including just functions below the Stan model level.

Why aren’t they good for regression testing performance?

They can regression test log density evals, not regression test algorithms, because algorithm testing is predicated on getting the right answer.

I agree that we could have additional tests that focus on (especially comparative) algorithm performance, though I’m still not convinced we’d only want to fit models that pass our judgment there. But this is just performance regressions, and you’re right that it only tests log density + gradient evaluations in some sense, but for benchmarking an important consideration is to test those in a realistic context.

I don’t know any alternative. I don’t see how we could test a model that we can’t fit reliably. We can test how long 2000 iterations takes, but that doesn’t tell us anything if it doesn’t converge and mix. We’ve always been trying to test speed conditioned on getting the right answer.

I hope to bring this up in the meeting - I think a fitting model is important for getting certain kinds of information out of a performance test. But for what I will call a regression test, where you just want to make sure performance doesn’t go down from things one doesn’t think should affect performance, goodness-of-fit takes a backseat to pure code coverage.

Got it. This’ll test the math lib and generated code. We can come back to the issue if we modify any algorithms.