Unintuitive Benchmark Thread

Hey y’all,

I have discovered some weird, unintuitive stuff when running benchmarks about the performance of C++ that seems to sometimes even go against traditional wisdom, and I wanted to publish them somewhere we can all see and reference. All the source is in the perf-math repo, though I need to better organize it soon. I’m going to try to collect weird benchmark results in this post and keep it up to date.

std::vectors

.reserve() followed by push_back is slower than normal initialization and operator[] assignment

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
push                       5100 ns         4505 ns       135015
reserve_push               2474 ns         2371 ns       297215
initialize_op_assign        900 ns          892 ns       652486

Eigen::VectorXd element access is slower than std::vector’s

…even with .coeff(), though just slightly:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_EigenElementAccess       0.665 ns        0.662 ns   1000000000
BM_EigenCoeff               0.420 ns        0.419 ns   1000000000
BM_VectorElementAccess      0.347 ns        0.347 ns   1000000000

Anyone else have some they want to include?

4 Likes

First is pretty interesting! I wonder how much that changes when the 3rd has to initialize a more complex type

1 Like

Yeah, there must be some threshold, or perhaps doubles and ints are just special cased in the compiler, haha.

This doesn’t make any sense. Have you looked at the generated assembly code? I suspect a compiler bug or something.

Not yet! Hoping this thread sparks discussion too :) We can summarize the results in the top post or some eventual wiki.

Are you talking about the first result or the 2nd?

Here are my results:

------------------------------------------------------------
Benchmark                     Time           CPU Iterations
------------------------------------------------------------
push                        732 ns        732 ns     955334
reserve_push                580 ns        580 ns    1206236
initialize_op_assign        260 ns        260 ns    2694619
BM_EigenElementAccess         0 ns          0 ns 1000000000
BM_EigenCoeff                 0 ns          0 ns 1000000000
BM_VectorElementAccess        0 ns          0 ns 1000000000

Thought that the 0 results might be of interest more than anything.

Weird! I notice the results are printed as if they ran in the same benchmark run - did you link them together? Maybe that caused weirdness?

Both.

They were run separately. I manually removed the duplicated column heading here.

Did you also try what happens if you insert into a std vector? The std lib should optimize that accordingly…at that is what I would expect.

That’s what I’d have expected for double. Push-back is doing extra work compared to the unchecked set.

Exactly. Whether preallocating will be faster will depend on what’s being allocated. When you do this:

 std::vector<double> v(kSize);

It doesn’t have to do anything other than allocate a size. But if you did this

std::vector<var> v(kSize);

it’s a very different story because now the default constructor var() gets called kSize times to make sure v is initialized properly.

That’s surprising. What is Eigen doing wrong? This should just be the same memory dereferencing.

Something we’ve verified recently is that threading performance results don’t hold across different versions of compilers (gcc 5 vs gcc 6), they don’t hold across different compilers (clang++ vs g++), and they don’t hold across OSes (Windows gcc 4 vs Linux gcc 4).

I think it’s still pretty safe to assume that numeric computations are optimized similarly, although I’ve seen threads saying that’s not true for Intel. We should still check.

Would it be of interest to develop a custom test-suite via phoronix?

http://www.phoronix-test-suite.com/

That would be interested, I’m hoping this week or next to write up something like a spec with the below. If you have anytime feel free!

@increasechief and @stevebronder If you’re writing a design doc feel free to submit it as a pull request on the design-docs repo and just paste the link here. Then we can all comment on it there and and it can evolve with feedback. Thanks!

Do you have examples of these? I remember seeing only that performance improvements might appear only for clang and not for gcc, but never the inversion of a benchmark result such that what seemed like the best answer on X was not the best answer on Y. Of course, separately we saw that the Mac Pro had a much steeper performance penalty for pointer AD stacks compared with all of our Mac laptops (which I guess is a 4th dimension across which benchmarks can vary - hardware). But I don’t remember seeing benchmarks conflict about which code was faster, just the magnitudes, right?

Here you go: https://github.com/stan-dev/design-docs/pull/4.