Profiling C++ code

rstan recommends on its wiki to turn on -march=native.

1 Like

If youā€™re worried about that, start a new one.

Nope. We should try it and see what happens. The linked page says:

On the x86-64 architecture, SSE2 is generally enabled by default, but you can enable AVX and FMA for better performance

And I thought we were doing this:

On GCC and clang you can simply pass -march=native to let the compiler enables all instruction set that are supported by your CPU.

If not, we probably should!

I got nervous about this auto thing and finally sat down with a benchmarking library and figured out how to use it, partially by watching this pretty great talk: https://www.youtube.com/watch?v=nXaxk27zwlk

So I wrote some benchmarks with Google Benchmark and the tricks in the talk, and it looks like auto is only faster with matrices that are actually vectors, for some reason. Can anyone verify or comment on this? If not I think I will remove the replacements of MatrixXd with auto from the PR.

1 Like

Iā€™m looking at the eigen pitfalls section and seeing under section C++11 and auto

C++11 & auto
In short: do not use the auto keywords with Eigenā€™s expressions, unless you are 100% sure about what you are doing. In particular, do not use the auto keyword as a replacement for a Matrix<> type. Here is an example:

MatrixXd A, B;
auto C = A*B;
for(...) { ... w = C * v;  ...}

In this example, the type of C is not a MatrixXd but an abstract expression representing a matrix product and storing references to A and B. Therefore, the product of A*B will be carried out multiple times, once per iteration of the for loop. Moreover, if the coefficients of A or B change during the iteration, then C will evaluate to different values.

Iā€™m not an Eigen expert, but looking at your benchmark it looks similar. Not sure if you saw this or not so throwing it out there

Yep, Iā€™m benchmarking auto vs. not using auto for specific operations and resulting use-cases. In the code I was looking at in the linked PR above, weā€™re doing a multiply and transpose ish and then copying the data out in a single loop. I was initially surprised that auto provided a 35% speedup on our sole performance test in Jenkins (the logistic regression one), so I dug in further. Seems like thatā€™s only true when weā€™re dealing with Eigen Vectors, and not Eigen Matrices, for some reason. So now Iā€™m wondering if anyone can rationalize that or knows more about it before I remove some of the updates in the PR.

Cool. Iā€™ll have to watch the talk. And learn what auto _ means.

That sure does.

Just think through what the template expressions are doing. When you have matrix times matrix and Eigen leaves it as an expression, itā€™s a \mathcal{O}(n) operation to grab a member, and thereā€™s no memory locality.

When you write it out to a base Matrix type, it gets evaluated once and copied. The copy is expensive, but chepaer than matrix transpose times matrix.

So to paraphrase, youā€™re saying that the amount of work each is doing is the same, but an allocation + copy beats the lazy version because of memory locality? Or are you saying the lazy version actually also does more work?

Just trace the arithmetic and memory locality. When two matrices are multiplied, the first matrix is indexed by row N x N times and the seocnd matrix is indexed by column N x N times. Itā€™s best to transpose the first matrix once if its big enough not to fit in cache with the second matrix.

Sorry, was asking to try to make sure of some pretty basic things: There are no algorithms by which multiplying an entire matrix at once is faster than figuring out each cell answer individually, right? Unless youā€™re saying that when we do the entire operation at once, we can do that transpose first and get that locality, whereas when we go cell-by-cell (i.e. computing an answer for each a_{i, j} individually) we canā€™t do that?

IF there is just memory locality to think about and we ignore the transpose thing, Iā€™m still not sure why the allocation + holistic multiply would beat the cell-by-cellā€¦

Exactly.

In that case, there shouldnā€™t be any difference. By allocation plus holistic multiply, Iā€™m assuming you mean allocation and assignment to resolve the expression template and then just ordinary matrix multiplication. When you do that, Eigen will transpose the left-hand side once for the purposes of memory locality. It should also do that if thereā€™s an expression template, but maybe it canā€™t figure it out at that point.

1 Like

Yeah, must not be able to figure it out. I might dive in to the assembly to see if I can verify that on these simple benchmarksā€¦