Profiling C++ code



Does anyone have experience profiling C++ code? @yizhang maybe? We keep coming up with speed concerns and guessing at what is slow, but it seems like even benchmarking can be extremely misleading.

@Matthijs and @wds15 currently want to learn how to do this to figure out performance issues. What have you guys tried so far? For windows, I’ve heard Very Sleep can be a good first pass. For Mac, I’ve heard Instruments from Xcode is the tool to use. But very curious if someone has a tutorial or has a list of gotchas (like how to deal with compilers that optimize our benchmarks away, etc).


I am running end-to-end tests on models I trust to create a huge load. This has been very sensitive to most changes I ever did, but this is not very convenient.

I am all open to better approaches.


For what it’s worth, I’ve used Tau and GPerfTools, though it’s been a while.


I use valgrind and tau. But for heavily templated code like stan it’s worth to examine instantiation as well. For that we can try templight. Valgrind/Cachegrind/callgrind is the most accurate with multi-threading run but comes with a price of significant slowing-down.

For MPI run, in addition to end-to-end test, we need break it down the profiling into computing and communication. tau can examine both.

I heard gperftools is much faster than valgrind but less accurate.


gperftools worked nicely for me before on some pretty low level Stan stuff: Adjoint sensitivities & Adjoint sensitivities (those are two different links – Discourse is automatically squishing them). Dunno about automated perf tests though.


Regardless of profiler used, a lot of the time it’s easier to manually go through debug build to see where the code moves slowly. Then turn on -O3 after fixing the suspects.


I used Xcode’s Instruments extensively when developing Nomad. Instruments has some super slick UI features, like mapping C++ to the corresponding machine instructions and runtime percentage, access to CPU diagnostic buffers, etc that were extremely useful. That said, as with many performance tools heavily inlined code can be hard to analyze.


I’ve only profiled on a Mac using Instruments. Unfortunately, the Time Profiler functionality of Instruments wasn’t actually something that provided useful results for profiling. If I recall, it would indicate that a large percentage of time was spent in the error checking functions. (When @bgoodri ran gdb, I believe he also saw similar results.) When running code that doesn’t have the debug instructions, it isn’t really where the time is being spent.

We discussed profiling quite a bit on the old google groups. Here are some relevant threads:

Here are a couple points I remember off the top of my head:

  • instrumenting profilers by use of debug symbols isn’t representative for the type of code we have (heavily inlined and heavily templated)
  • statistical profilers show most of the time being spent in chain() methods (I think @bgoodri showed me how to do this once; maybe gprof?)


I have noticed this C++ profiler which seems interesting and cross-platform. However, I have not checked it myself yet.


I can add the following observation (which I’m sure is obvious to most of you):
based on my experience and the top answer here, I think that naively inserting timing statements in your C++ is something to be careful about.
Presumably, using a profiler is a better idea, but I don’t have any experience with those.


This is what I tend to do, too, as it’s a reliable detector of large changes. We havent tended to worry about small ones, though they do add up, especially in memory contention.


Everything in C++ is something to be careful about :-)

But it’s particularly dangerous at higher optimization levels, which will optimize away your code.

I really liked Agner Fog’s manuals on C++ optimization, but don’t recall what he does for profiling:


When it comes to optimizing debug build -Og is the one to go. My experience is it gives not-far-from-behind performance to -O1 build, though this depends on code structure and style.


Here’s a good post showing the way around LLVM, which I might suggest is useful in looking at the generated IR (instead of assembly) and seeing what you can do with a modular multi-phase compiler architecture.


Our code behaves very very differently at different optimization levels. It relies very heavily on -O3 optimization, particularly inlining and static template evaluation and code flow evaluation to remove unused branch predictions.


Thanks—that looks awesome! I’ve always wondered how this works.