Profiling C++ code



Everything in C++ is something to be careful about :-)

But it’s particularly dangerous at higher optimization levels, which will optimize away your code.

I really liked Agner Fog’s manuals on C++ optimization, but don’t recall what he does for profiling:


When it comes to optimizing debug build -Og is the one to go. My experience is it gives not-far-from-behind performance to -O1 build, though this depends on code structure and style.


Here’s a good post showing the way around LLVM, which I might suggest is useful in looking at the generated IR (instead of assembly) and seeing what you can do with a modular multi-phase compiler architecture.


Our code behaves very very differently at different optimization levels. It relies very heavily on -O3 optimization, particularly inlining and static template evaluation and code flow evaluation to remove unused branch predictions.


Thanks—that looks awesome! I’ve always wondered how this works.


I just got the performance test down from 1.67s to 1.11s through profiling, finding out that the hotspot was actually in matrix_mat_vari::chain, and changing an explicit Eigen allocation to a C++11 auto. This is why profiling is awesome!

I will submit a PR with this to make sure it’s not crazy. Also I haven’t had any issues - I just put this in my make/local:


Recompiled things, and on a Mac, ran

instruments -t "Time Profile" ./test/performance/logistic
open instrumentscli9.trace/

(9 there because this is my 9th invocation of the instruments profiler)

So excited - I think we can probably improve this particular thing over the codebase. With more performance tests, we can find more of these.


That’s awesome! Could you put up instructions on a wiki?

Mind putting up a screenshot of the profile output? I’m curious how you interpreted the results to determine that was the hot spot.



Yeah, going to spend a lot of tomorrow with it and talking this through with @Matthijs. Want to make sure I’m not crazy about this particular optimization though - here’s the PR: I think only the changes on lines 349 and 352 were necessary (changing the declaration of adjB to auto from VectorXd adjB(size_)), but wanted to change the rest as much as I easily could for symmetry.

Here’s a screenshot of Instruments reading the trace generated with the above command:

The really crucial thing with Instruments is to hit this checkbox on the Call Tree menu at the bottom:

Inverting the Call Tree lets you find the hotspots much more easily.

I’m still trying to figure out how to actually count method invocations, because that’s super useful sometimes. But Instruments crashes on my Mac when I try to do that, haha.

Profiling the execution of a model

I am maybe hijacking this thread too much, but I was wondering if anyone had tried turning on vectorization? It’s one of the first things Eigen’s FAQ recommends, but we don’t have it turned on anywhere:


rstan recommends on its wiki to turn on -march=native.


If you’re worried about that, start a new one.

Nope. We should try it and see what happens. The linked page says:

On the x86-64 architecture, SSE2 is generally enabled by default, but you can enable AVX and FMA for better performance

And I thought we were doing this:

On GCC and clang you can simply pass -march=native to let the compiler enables all instruction set that are supported by your CPU.

If not, we probably should!


I got nervous about this auto thing and finally sat down with a benchmarking library and figured out how to use it, partially by watching this pretty great talk:

So I wrote some benchmarks with Google Benchmark and the tricks in the talk, and it looks like auto is only faster with matrices that are actually vectors, for some reason. Can anyone verify or comment on this? If not I think I will remove the replacements of MatrixXd with auto from the PR.


I’m looking at the eigen pitfalls section and seeing under section C++11 and auto

C++11 & auto
In short: do not use the auto keywords with Eigen’s expressions, unless you are 100% sure about what you are doing. In particular, do not use the auto keyword as a replacement for a Matrix<> type. Here is an example:

MatrixXd A, B;
auto C = A*B;
for(...) { ... w = C * v;  ...}

In this example, the type of C is not a MatrixXd but an abstract expression representing a matrix product and storing references to A and B. Therefore, the product of A*B will be carried out multiple times, once per iteration of the for loop. Moreover, if the coefficients of A or B change during the iteration, then C will evaluate to different values.

I’m not an Eigen expert, but looking at your benchmark it looks similar. Not sure if you saw this or not so throwing it out there


Yep, I’m benchmarking auto vs. not using auto for specific operations and resulting use-cases. In the code I was looking at in the linked PR above, we’re doing a multiply and transpose ish and then copying the data out in a single loop. I was initially surprised that auto provided a 35% speedup on our sole performance test in Jenkins (the logistic regression one), so I dug in further. Seems like that’s only true when we’re dealing with Eigen Vectors, and not Eigen Matrices, for some reason. So now I’m wondering if anyone can rationalize that or knows more about it before I remove some of the updates in the PR.


Cool. I’ll have to watch the talk. And learn what auto _ means.

That sure does.

Just think through what the template expressions are doing. When you have matrix times matrix and Eigen leaves it as an expression, it’s a \mathcal{O}(n) operation to grab a member, and there’s no memory locality.

When you write it out to a base Matrix type, it gets evaluated once and copied. The copy is expensive, but chepaer than matrix transpose times matrix.


So to paraphrase, you’re saying that the amount of work each is doing is the same, but an allocation + copy beats the lazy version because of memory locality? Or are you saying the lazy version actually also does more work?


Just trace the arithmetic and memory locality. When two matrices are multiplied, the first matrix is indexed by row N x N times and the seocnd matrix is indexed by column N x N times. It’s best to transpose the first matrix once if its big enough not to fit in cache with the second matrix.


Sorry, was asking to try to make sure of some pretty basic things: There are no algorithms by which multiplying an entire matrix at once is faster than figuring out each cell answer individually, right? Unless you’re saying that when we do the entire operation at once, we can do that transpose first and get that locality, whereas when we go cell-by-cell (i.e. computing an answer for each a_{i, j} individually) we can’t do that?

IF there is just memory locality to think about and we ignore the transpose thing, I’m still not sure why the allocation + holistic multiply would beat the cell-by-cell…



In that case, there shouldn’t be any difference. By allocation plus holistic multiply, I’m assuming you mean allocation and assignment to resolve the expression template and then just ordinary matrix multiplication. When you do that, Eigen will transpose the left-hand side once for the purposes of memory locality. It should also do that if there’s an expression template, but maybe it can’t figure it out at that point.


Yeah, must not be able to figure it out. I might dive in to the assembly to see if I can verify that on these simple benchmarks…