Unintuitive Benchmark Thread

maintenance
#1

Hey y’all,

I have discovered some weird, unintuitive stuff when running benchmarks about the performance of C++ that seems to sometimes even go against traditional wisdom, and I wanted to publish them somewhere we can all see and reference. All the source is in the perf-math repo, though I need to better organize it soon. I’m going to try to collect weird benchmark results in this post and keep it up to date.

std::vectors

.reserve() followed by push_back is slower than normal initialization and operator[] assignment

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
push                       5100 ns         4505 ns       135015
reserve_push               2474 ns         2371 ns       297215
initialize_op_assign        900 ns          892 ns       652486

Eigen::VectorXd element access is slower than std::vector’s

…even with .coeff(), though just slightly:

-----------------------------------------------------------------
Benchmark                       Time             CPU   Iterations
-----------------------------------------------------------------
BM_EigenElementAccess       0.665 ns        0.662 ns   1000000000
BM_EigenCoeff               0.420 ns        0.419 ns   1000000000
BM_VectorElementAccess      0.347 ns        0.347 ns   1000000000

Anyone else have some they want to include?

4 Likes

#2

First is pretty interesting! I wonder how much that changes when the 3rd has to initialize a more complex type

1 Like

#3

Yeah, there must be some threshold, or perhaps doubles and ints are just special cased in the compiler, haha.

0 Likes

#4

This doesn’t make any sense. Have you looked at the generated assembly code? I suspect a compiler bug or something.

0 Likes

#5

Not yet! Hoping this thread sparks discussion too :) We can summarize the results in the top post or some eventual wiki.

Are you talking about the first result or the 2nd?

0 Likes

#6

Here are my results:

------------------------------------------------------------
Benchmark                     Time           CPU Iterations
------------------------------------------------------------
push                        732 ns        732 ns     955334
reserve_push                580 ns        580 ns    1206236
initialize_op_assign        260 ns        260 ns    2694619
BM_EigenElementAccess         0 ns          0 ns 1000000000
BM_EigenCoeff                 0 ns          0 ns 1000000000
BM_VectorElementAccess        0 ns          0 ns 1000000000

Thought that the 0 results might be of interest more than anything.

0 Likes

#7

Weird! I notice the results are printed as if they ran in the same benchmark run - did you link them together? Maybe that caused weirdness?

0 Likes

#8

Both.

0 Likes

#9

They were run separately. I manually removed the duplicated column heading here.

0 Likes

#10

Did you also try what happens if you insert into a std vector? The std lib should optimize that accordingly…at that is what I would expect.

0 Likes

#11

That’s what I’d have expected for double. Push-back is doing extra work compared to the unchecked set.

Exactly. Whether preallocating will be faster will depend on what’s being allocated. When you do this:

 std::vector<double> v(kSize);

It doesn’t have to do anything other than allocate a size. But if you did this

std::vector<var> v(kSize);

it’s a very different story because now the default constructor var() gets called kSize times to make sure v is initialized properly.

That’s surprising. What is Eigen doing wrong? This should just be the same memory dereferencing.

0 Likes

#12

Something we’ve verified recently is that threading performance results don’t hold across different versions of compilers (gcc 5 vs gcc 6), they don’t hold across different compilers (clang++ vs g++), and they don’t hold across OSes (Windows gcc 4 vs Linux gcc 4).

I think it’s still pretty safe to assume that numeric computations are optimized similarly, although I’ve seen threads saying that’s not true for Intel. We should still check.

0 Likes

#13

Would it be of interest to develop a custom test-suite via phoronix?

http://www.phoronix-test-suite.com/

0 Likes

#14

That would be interested, I’m hoping this week or next to write up something like a spec with the below. If you have anytime feel free!

0 Likes

#17

@increasechief and @Stevo15025 If you’re writing a design doc feel free to submit it as a pull request on the design-docs repo and just paste the link here. Then we can all comment on it there and and it can evolve with feedback. Thanks!

Do you have examples of these? I remember seeing only that performance improvements might appear only for clang and not for gcc, but never the inversion of a benchmark result such that what seemed like the best answer on X was not the best answer on Y. Of course, separately we saw that the Mac Pro had a much steeper performance penalty for pointer AD stacks compared with all of our Mac laptops (which I guess is a 4th dimension across which benchmarks can vary - hardware). But I don’t remember seeing benchmarks conflict about which code was faster, just the magnitudes, right?

0 Likes

#18

Here you go: https://github.com/stan-dev/design-docs/pull/4.

0 Likes