Unintuitive Benchmark Thread

seantalts · March 17, 2019, 11:19pm

Hey y’all,

I have discovered some weird, unintuitive stuff when running benchmarks about the performance of C++ that seems to sometimes even go against traditional wisdom, and I wanted to publish them somewhere we can all see and reference. All the source is in the perf-math repo, though I need to better organize it soon. I’m going to try to collect weird benchmark results in this post and keep it up to date.

std::vectors

`.reserve()` followed by `push_back` is slower than normal initialization and `operator[]` assignment

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
push                       5100 ns         4505 ns       135015
reserve_push               2474 ns         2371 ns       297215
initialize_op_assign        900 ns          892 ns       652486

github.com

seantalts/perf-math/blob/master/stdvectorfun.cpp

#include <benchmark/benchmark.h>

const int kSize = 1000;

static void push(benchmark::State& state) {
  for (auto _ : state) {
    std::vector<double> v;
    benchmark::DoNotOptimize(v.data());
    for (int i = 0; i < kSize; ++i)
      v.push_back(i + 27.2);
    benchmark::ClobberMemory();
  }
}
BENCHMARK(push);

static void reserve_push(benchmark::State& state) {
  for (auto _ : state) {
    std::vector<double> v;
    v.reserve(kSize);
    benchmark::DoNotOptimize(v.data());

This file has been truncated. show original

Eigen::VectorXd element access is slower than std::vector’s

…even with .coeff(), though just slightly:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_EigenElementAccess       0.665 ns        0.662 ns   1000000000
BM_EigenCoeff               0.420 ns        0.419 ns   1000000000
BM_VectorElementAccess      0.347 ns        0.347 ns   1000000000

github.com

seantalts/perf-math/blob/master/element_access.cpp

#include <benchmark/benchmark.h>
#include <Eigen/Dense>

static void BM_EigenElementAccess(benchmark::State& state) {
  using Eigen::MatrixXd;
  auto m_d = MatrixXd::Random(500, 500).eval();

  for (auto _ : state) {
    benchmark::DoNotOptimize(m_d.data());
    m_d(400);
    benchmark::ClobberMemory();
  }
}
BENCHMARK(BM_EigenElementAccess);

static void BM_EigenCoeff(benchmark::State& state) {
  using Eigen::MatrixXd;
  auto m_d = MatrixXd::Random(500, 500).eval();

  for (auto _ : state) {

This file has been truncated. show original

Anyone else have some they want to include?

stevebronder · March 18, 2019, 6:32am

First is pretty interesting! I wonder how much that changes when the 3rd has to initialize a more complex type

seantalts · March 18, 2019, 8:09am

Yeah, there must be some threshold, or perhaps doubles and ints are just special cased in the compiler, haha.

jpritikin · March 18, 2019, 12:07pm

This doesn’t make any sense. Have you looked at the generated assembly code? I suspect a compiler bug or something.

seantalts · March 18, 2019, 11:27pm

Not yet! Hoping this thread sparks discussion too :) We can summarize the results in the top post or some eventual wiki.

Are you talking about the first result or the 2nd?

increasechief · March 19, 2019, 3:24am

Here are my results:

------------------------------------------------------------
Benchmark                     Time           CPU Iterations
------------------------------------------------------------
push                        732 ns        732 ns     955334
reserve_push                580 ns        580 ns    1206236
initialize_op_assign        260 ns        260 ns    2694619
BM_EigenElementAccess         0 ns          0 ns 1000000000
BM_EigenCoeff                 0 ns          0 ns 1000000000
BM_VectorElementAccess        0 ns          0 ns 1000000000

Thought that the 0 results might be of interest more than anything.

seantalts · March 19, 2019, 8:12am

Weird! I notice the results are printed as if they ran in the same benchmark run - did you link them together? Maybe that caused weirdness?

jpritikin · March 19, 2019, 12:30pm

Both.

increasechief · March 19, 2019, 1:47pm

They were run separately. I manually removed the duplicated column heading here.

wds15 · March 19, 2019, 4:33pm

Did you also try what happens if you insert into a std vector? The std lib should optimize that accordingly…at that is what I would expect.

Bob_Carpenter · March 29, 2019, 5:38pm

That’s what I’d have expected for double. Push-back is doing extra work compared to the unchecked set.

Exactly. Whether preallocating will be faster will depend on what’s being allocated. When you do this:

 std::vector<double> v(kSize);

It doesn’t have to do anything other than allocate a size. But if you did this

std::vector<var> v(kSize);

it’s a very different story because now the default constructor var() gets called kSize times to make sure v is initialized properly.

That’s surprising. What is Eigen doing wrong? This should just be the same memory dereferencing.

syclik · April 12, 2019, 2:16pm

Something we’ve verified recently is that threading performance results don’t hold across different versions of compilers (gcc 5 vs gcc 6), they don’t hold across different compilers (clang++ vs g++), and they don’t hold across OSes (Windows gcc 4 vs Linux gcc 4).

I think it’s still pretty safe to assume that numeric computations are optimized similarly, although I’ve seen threads saying that’s not true for Intel. We should still check.

increasechief · April 13, 2019, 4:37pm

Would it be of interest to develop a custom test-suite via phoronix?

http://www.phoronix-test-suite.com/

stevebronder · April 14, 2019, 9:04pm

That would be interested, I’m hoping this week or next to write up something like a spec with the below. If you have anytime feel free!

github.com

stan-dev/design-docs/blob/master/0000-template.md

- Feature Name: (fill me in with a unique ident, my_awesome_feature)
- Start Date: (fill me in with today's date, YYYY-MM-DD)
- RFC PR: (leave this empty)
- Stan Issue: (leave this empty)

# Summary
[summary]: #summary

One paragraph explanation of the feature.

# Motivation
[motivation]: #motivation

Why are we doing this? What use cases does it support? What is the expected outcome?

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

Explain the proposal as if it was already included in the language and you were teaching it to another Rust programmer. That generally means:

This file has been truncated. show original

seantalts · April 15, 2019, 1:34am

@increasechief and @stevebronder If you’re writing a design doc feel free to submit it as a pull request on the design-docs repo and just paste the link here. Then we can all comment on it there and and it can evolve with feedback. Thanks!

Do you have examples of these? I remember seeing only that performance improvements might appear only for clang and not for gcc, but never the inversion of a benchmark result such that what seemed like the best answer on X was not the best answer on Y. Of course, separately we saw that the Mac Pro had a much steeper performance penalty for pointer AD stacks compared with all of our Mac laptops (which I guess is a 4th dimension across which benchmarks can vary - hardware). But I don’t remember seeing benchmarks conflict about which code was faster, just the magnitudes, right?

increasechief · April 15, 2019, 3:16am

Here you go: 0003 "standev_phoronix_test_suite" · Pull Request #4 · stan-dev/design-docs · GitHub.

Topic		Replies	Views
Profiling C++ code Developers math	30	11169	March 26, 2018
15-20% ish performance regression Developers	6	793	April 9, 2018
Benchmarking thread batching for map_rect_concurrent Developers	10	613	January 6, 2019
Reflection on the v2.17.0 stan-dev/math performance bug Developers bug	4	960	February 9, 2018
Stan 2.19 release planned for Monday March 18th Developers	52	3699	June 5, 2019

Unintuitive Benchmark Thread

std::vectors

.reserve() followed by push_back is slower than normal initialization and operator[] assignment

Eigen::VectorXd element access is slower than std::vector’s

Related topics

`.reserve()` followed by `push_back` is slower than normal initialization and `operator[]` assignment