Profiling C++ code

wds15 · March 16, 2018, 3:20pm

rstan recommends on its wiki to turn on -march=native.

Bob_Carpenter · March 19, 2018, 5:12am

If you’re worried about that, start a new one.

Nope. We should try it and see what happens. The linked page says:

On the x86-64 architecture, SSE2 is generally enabled by default, but you can enable AVX and FMA for better performance

And I thought we were doing this:

On GCC and clang you can simply pass -march=native to let the compiler enables all instruction set that are supported by your CPU.

If not, we probably should!

seantalts · March 19, 2018, 10:48pm

I got nervous about this auto thing and finally sat down with a benchmarking library and figured out how to use it, partially by watching this pretty great talk: https://www.youtube.com/watch?v=nXaxk27zwlk

So I wrote some benchmarks with Google Benchmark and the tricks in the talk, and it looks like auto is only faster with matrices that are actually vectors, for some reason. Can anyone verify or comment on this? If not I think I will remove the replacements of MatrixXd with auto from the PR.

stevebronder · March 20, 2018, 2:48pm

I’m looking at the eigen pitfalls section and seeing under section C++11 and auto

C++11 & auto
In short: do not use the auto keywords with Eigen’s expressions, unless you are 100% sure about what you are doing. In particular, do not use the auto keyword as a replacement for a Matrix<> type. Here is an example:
MatrixXd A, B;
auto C = A*B;
for(...) { ... w = C * v;  ...}
In this example, the type of C is not a MatrixXd but an abstract expression representing a matrix product and storing references to A and B. Therefore, the product of A*B will be carried out multiple times, once per iteration of the for loop. Moreover, if the coefficients of A or B change during the iteration, then C will evaluate to different values.

I’m not an Eigen expert, but looking at your benchmark it looks similar. Not sure if you saw this or not so throwing it out there

github.com

seantalts/cppperf/blob/master/eigen_auto_bench.cpp#L23


      
            v.reserve(S*S);
            escape(v.data());
            for (auto _ : state) {
              auto r = m1.transpose() * m2 * 3.0;
              for (int i = 0; i < S*S; ++i)
                v[i] = r(i);
              clobber();
            }
          }
          BENCHMARK_TEMPLATE(BM_EigenAuto, 1);
          BENCHMARK_TEMPLATE(BM_EigenAuto, 2);
          BENCHMARK_TEMPLATE(BM_EigenAuto, 10);
          BENCHMARK_TEMPLATE(BM_EigenAuto, 100);
          
          template<int S>
          static void BM_EigenExplicit(benchmark::State& state) {
            MatrixXd m1(S, S);
            MatrixXd m2(S, S);
            std::vector<double> v;
            v.reserve(S*S);
            escape(v.data());

seantalts · March 20, 2018, 3:52pm

Yep, I’m benchmarking auto vs. not using auto for specific operations and resulting use-cases. In the code I was looking at in the linked PR above, we’re doing a multiply and transpose ish and then copying the data out in a single loop. I was initially surprised that auto provided a 35% speedup on our sole performance test in Jenkins (the logistic regression one), so I dug in further. Seems like that’s only true when we’re dealing with Eigen Vectors, and not Eigen Matrices, for some reason. So now I’m wondering if anyone can rationalize that or knows more about it before I remove some of the updates in the PR.

Bob_Carpenter · March 20, 2018, 5:28pm

Cool. I’ll have to watch the talk. And learn what auto _ means.

That sure does.

Just think through what the template expressions are doing. When you have matrix times matrix and Eigen leaves it as an expression, it’s a \mathcal{O}(n) operation to grab a member, and there’s no memory locality.

When you write it out to a base Matrix type, it gets evaluated once and copied. The copy is expensive, but chepaer than matrix transpose times matrix.

seantalts · March 20, 2018, 5:47pm

So to paraphrase, you’re saying that the amount of work each is doing is the same, but an allocation + copy beats the lazy version because of memory locality? Or are you saying the lazy version actually also does more work?

Bob_Carpenter · March 20, 2018, 10:18pm

Just trace the arithmetic and memory locality. When two matrices are multiplied, the first matrix is indexed by row N x N times and the seocnd matrix is indexed by column N x N times. It’s best to transpose the first matrix once if its big enough not to fit in cache with the second matrix.

seantalts · March 21, 2018, 7:57pm

Sorry, was asking to try to make sure of some pretty basic things: There are no algorithms by which multiplying an entire matrix at once is faster than figuring out each cell answer individually, right? Unless you’re saying that when we do the entire operation at once, we can do that transpose first and get that locality, whereas when we go cell-by-cell (i.e. computing an answer for each a_{i, j} individually) we can’t do that?

IF there is just memory locality to think about and we ignore the transpose thing, I’m still not sure why the allocation + holistic multiply would beat the cell-by-cell…

Bob_Carpenter · March 24, 2018, 11:44pm

Exactly.

In that case, there shouldn’t be any difference. By allocation plus holistic multiply, I’m assuming you mean allocation and assignment to resolve the expression template and then just ordinary matrix multiplication. When you do that, Eigen will transpose the left-hand side once for the purposes of memory locality. It should also do that if there’s an expression template, but maybe it can’t figure it out at that point.

seantalts · March 26, 2018, 3:17pm

Yeah, must not be able to figure it out. I might dive in to the assembly to see if I can verify that on these simple benchmarks…

Topic		Replies	Views
Stanc3 optimization and analyses walkthrough during StanCon Meetings	6	998	August 22, 2019
Unintuitive Benchmark Thread Developers maintenance	15	1166	April 15, 2019
First stanc3 release candidate! Developers	9	2129	August 19, 2019
Intel compilers General	7	1171	October 22, 2017
One Compiler Per OS Developers	103	2422	July 24, 2018

Profiling C++ code

Related Topics