Profiling C++ code

math

#1

Does anyone have experience profiling C++ code? @yizhang maybe? We keep coming up with speed concerns and guessing at what is slow, but it seems like even benchmarking can be extremely misleading.

@Matthijs and @wds15 currently want to learn how to do this to figure out performance issues. What have you guys tried so far? For windows, I’ve heard Very Sleep can be a good first pass. For Mac, I’ve heard Instruments from Xcode is the tool to use. But very curious if someone has a tutorial or has a list of gotchas (like how to deal with compilers that optimize our benchmarks away, etc).


#2

I am running end-to-end tests on models I trust to create a huge load. This has been very sensitive to most changes I ever did, but this is not very convenient.

I am all open to better approaches.


#3

For what it’s worth, I’ve used Tau and GPerfTools, though it’s been a while.


#4

I use valgrind and tau. But for heavily templated code like stan it’s worth to examine instantiation as well. For that we can try templight. Valgrind/Cachegrind/callgrind is the most accurate with multi-threading run but comes with a price of significant slowing-down.

For MPI run, in addition to end-to-end test, we need break it down the profiling into computing and communication. tau can examine both.

I heard gperftools is much faster than valgrind but less accurate.


#5

gperftools worked nicely for me before on some pretty low level Stan stuff: Adjoint sensitivities & Adjoint sensitivities (those are two different links – Discourse is automatically squishing them). Dunno about automated perf tests though.


#6

Regardless of profiler used, a lot of the time it’s easier to manually go through debug build to see where the code moves slowly. Then turn on -O3 after fixing the suspects.


#7

I used Xcode’s Instruments extensively when developing Nomad. Instruments has some super slick UI features, like mapping C++ to the corresponding machine instructions and runtime percentage, access to CPU diagnostic buffers, etc that were extremely useful. That said, as with many performance tools heavily inlined code can be hard to analyze.


#8

I’ve only profiled on a Mac using Instruments. Unfortunately, the Time Profiler functionality of Instruments wasn’t actually something that provided useful results for profiling. If I recall, it would indicate that a large percentage of time was spent in the error checking functions. (When @bgoodri ran gdb, I believe he also saw similar results.) When running code that doesn’t have the debug instructions, it isn’t really where the time is being spent.

We discussed profiling quite a bit on the old google groups. Here are some relevant threads:

Here are a couple points I remember off the top of my head:

  • instrumenting profilers by use of debug symbols isn’t representative for the type of code we have (heavily inlined and heavily templated)
  • statistical profilers show most of the time being spent in chain() methods (I think @bgoodri showed me how to do this once; maybe gprof?)

#9

I have noticed this C++ profiler https://github.com/yse/easy_profiler which seems interesting and cross-platform. However, I have not checked it myself yet.


#10

I can add the following observation (which I’m sure is obvious to most of you):
based on my experience and the top answer here, I think that naively inserting timing statements in your C++ is something to be careful about.
Presumably, using a profiler is a better idea, but I don’t have any experience with those.


#11

This is what I tend to do, too, as it’s a reliable detector of large changes. We havent tended to worry about small ones, though they do add up, especially in memory contention.


#12

Everything in C++ is something to be careful about :-)

But it’s particularly dangerous at higher optimization levels, which will optimize away your code.

I really liked Agner Fog’s manuals on C++ optimization, but don’t recall what he does for profiling: http://www.agner.org/optimize/


#13

When it comes to optimizing debug build -Og is the one to go. My experience is it gives not-far-from-behind performance to -O1 build, though this depends on code structure and style.


#14

Here’s a good post showing the way around LLVM, which I might suggest is useful in looking at the generated IR (instead of assembly) and seeing what you can do with a modular multi-phase compiler architecture.


#15

Our code behaves very very differently at different optimization levels. It relies very heavily on -O3 optimization, particularly inlining and static template evaluation and code flow evaluation to remove unused branch predictions.


#16

Thanks—that looks awesome! I’ve always wondered how this works.


#17

I just got the performance test down from 1.67s to 1.11s through profiling, finding out that the hotspot was actually in matrix_mat_vari::chain, and changing an explicit Eigen allocation to a C++11 auto. This is why profiling is awesome!

I will submit a PR with this to make sure it’s not crazy. Also I haven’t had any issues - I just put this in my make/local:

CXXFLAGS += -g

Recompiled things, and on a Mac, ran

instruments -t "Time Profile" ./test/performance/logistic
open instrumentscli9.trace/

(9 there because this is my 9th invocation of the instruments profiler)

So excited - I think we can probably improve this particular thing over the codebase. With more performance tests, we can find more of these.


#18

That’s awesome! Could you put up instructions on a wiki?

Mind putting up a screenshot of the profile output? I’m curious how you interpreted the results to determine that was the hot spot.

AWESOME!


#19

Yeah, going to spend a lot of tomorrow with it and talking this through with @Matthijs. Want to make sure I’m not crazy about this particular optimization though - here’s the PR: https://github.com/stan-dev/math/pull/801. I think only the changes on lines 349 and 352 were necessary (changing the declaration of adjB to auto from VectorXd adjB(size_)), but wanted to change the rest as much as I easily could for symmetry.

Here’s a screenshot of Instruments reading the trace generated with the above command:

The really crucial thing with Instruments is to hit this checkbox on the Call Tree menu at the bottom:


Inverting the Call Tree lets you find the hotspots much more easily.

I’m still trying to figure out how to actually count method invocations, because that’s super useful sometimes. But Instruments crashes on my Mac when I try to do that, haha.


#20

I am maybe hijacking this thread too much, but I was wondering if anyone had tried turning on vectorization? It’s one of the first things Eigen’s FAQ recommends, but we don’t have it turned on anywhere:

http://eigen.tuxfamily.org/index.php?title=FAQ#How_can_I_enable_vectorization.3F