Profiling C++ code

seantalts · January 18, 2018, 5:32pm

Does anyone have experience profiling C++ code? @yizhang maybe? We keep coming up with speed concerns and guessing at what is slow, but it seems like even benchmarking can be extremely misleading.

@Matthijs and @wds15 currently want to learn how to do this to figure out performance issues. What have you guys tried so far? For windows, I’ve heard Very Sleep can be a good first pass. For Mac, I’ve heard Instruments from Xcode is the tool to use. But very curious if someone has a tutorial or has a list of gotchas (like how to deal with compilers that optimize our benchmarks away, etc).

wds15 · January 18, 2018, 5:38pm

I am running end-to-end tests on models I trust to create a huge load. This has been very sensitive to most changes I ever did, but this is not very convenient.

I am all open to better approaches.

jjramsey · January 18, 2018, 5:44pm

For what it’s worth, I’ve used Tau and GPerfTools, though it’s been a while.

yizhang · January 18, 2018, 6:14pm

I use valgrind and tau. But for heavily templated code like stan it’s worth to examine instantiation as well. For that we can try templight. Valgrind/Cachegrind/callgrind is the most accurate with multi-threading run but comes with a price of significant slowing-down.

For MPI run, in addition to end-to-end test, we need break it down the profiling into computing and communication. tau can examine both.

I heard gperftools is much faster than valgrind but less accurate.

bbbales2 · January 18, 2018, 6:19pm

gperftools worked nicely for me before on some pretty low level Stan stuff: Adjoint sensitivities & Adjoint sensitivities (those are two different links – Discourse is automatically squishing them). Dunno about automated perf tests though.

yizhang · January 18, 2018, 6:28pm

Regardless of profiler used, a lot of the time it’s easier to manually go through debug build to see where the code moves slowly. Then turn on -O3 after fixing the suspects.

betanalpha · January 18, 2018, 10:54pm

I used Xcode’s Instruments extensively when developing Nomad. Instruments has some super slick UI features, like mapping C++ to the corresponding machine instructions and runtime percentage, access to CPU diagnostic buffers, etc that were extremely useful. That said, as with many performance tools heavily inlined code can be hard to analyze.

syclik · January 19, 2018, 2:50am

I’ve only profiled on a Mac using Instruments. Unfortunately, the Time Profiler functionality of Instruments wasn’t actually something that provided useful results for profiling. If I recall, it would indicate that a large percentage of time was spent in the error checking functions. (When @bgoodri ran gdb, I believe he also saw similar results.) When running code that doesn’t have the debug instructions, it isn’t really where the time is being spent.

We discussed profiling quite a bit on the old google groups. Here are some relevant threads:

profiling tools
speed of check_not_nan
rstan vs cmdstan speed; this post within the thread is of particular interest which shows @bgoodri’s use of gdb: timing

Here are a couple points I remember off the top of my head:

instrumenting profilers by use of debug symbols isn’t representative for the type of code we have (heavily inlined and heavily templated)
statistical profilers show most of the time being spent in chain() methods (I think @bgoodri showed me how to do this once; maybe gprof?)

vincent-picaud · January 19, 2018, 9:14am

I have noticed this C++ profiler https://github.com/yse/easy_profiler which seems interesting and cross-platform. However, I have not checked it myself yet.

Matthijs · January 26, 2018, 4:35pm

I can add the following observation (which I’m sure is obvious to most of you):
based on my experience and the top answer here, I think that naively inserting timing statements in your C++ is something to be careful about.
Presumably, using a profiler is a better idea, but I don’t have any experience with those.

Bob_Carpenter · February 4, 2018, 7:00pm

This is what I tend to do, too, as it’s a reliable detector of large changes. We havent tended to worry about small ones, though they do add up, especially in memory contention.

Bob_Carpenter · February 4, 2018, 7:02pm

Everything in C++ is something to be careful about :-)

But it’s particularly dangerous at higher optimization levels, which will optimize away your code.

I really liked Agner Fog’s manuals on C++ optimization, but don’t recall what he does for profiling: http://www.agner.org/optimize/

yizhang · February 5, 2018, 3:58am

When it comes to optimizing debug build -Og is the one to go. My experience is it gives not-far-from-behind performance to -O1 build, though this depends on code structure and style.

seantalts · February 5, 2018, 2:37pm

Here’s a good post showing the way around LLVM, which I might suggest is useful in looking at the generated IR (instead of assembly) and seeing what you can do with a modular multi-phase compiler architecture.

Bob_Carpenter · February 6, 2018, 1:52am

Our code behaves very very differently at different optimization levels. It relies very heavily on -O3 optimization, particularly inlining and static template evaluation and code flow evaluation to remove unused branch predictions.

Bob_Carpenter · February 6, 2018, 1:54am

Thanks—that looks awesome! I’ve always wondered how this works.

seantalts · March 15, 2018, 9:04pm

I just got the performance test down from 1.67s to 1.11s through profiling, finding out that the hotspot was actually in matrix_mat_vari::chain, and changing an explicit Eigen allocation to a C++11 auto. This is why profiling is awesome!

I will submit a PR with this to make sure it’s not crazy. Also I haven’t had any issues - I just put this in my make/local:

CXXFLAGS += -g

Recompiled things, and on a Mac, ran

instruments -t "Time Profile" ./test/performance/logistic
open instrumentscli9.trace/

(9 there because this is my 9th invocation of the instruments profiler)

So excited - I think we can probably improve this particular thing over the codebase. With more performance tests, we can find more of these.

syclik · March 15, 2018, 9:11pm

That’s awesome! Could you put up instructions on a wiki?

Mind putting up a screenshot of the profile output? I’m curious how you interpreted the results to determine that was the hot spot.

AWESOME!

seantalts · March 15, 2018, 9:16pm

Yeah, going to spend a lot of tomorrow with it and talking this through with @Matthijs. Want to make sure I’m not crazy about this particular optimization though - here’s the PR: https://github.com/stan-dev/math/pull/801. I think only the changes on lines 349 and 352 were necessary (changing the declaration of adjB to auto from VectorXd adjB(size_)), but wanted to change the rest as much as I easily could for symmetry.

Here’s a screenshot of Instruments reading the trace generated with the above command:

The really crucial thing with Instruments is to hit this checkbox on the Call Tree menu at the bottom:

Inverting the Call Tree lets you find the hotspots much more easily.

I’m still trying to figure out how to actually count method invocations, because that’s super useful sometimes. But Instruments crashes on my Mac when I try to do that, haha.

seantalts · March 16, 2018, 2:28pm

I am maybe hijacking this thread too much, but I was wondering if anyone had tried turning on vectorization? It’s one of the first things Eigen’s FAQ recommends, but we don’t have it turned on anywhere:

http://eigen.tuxfamily.org/index.php?title=FAQ#How_can_I_enable_vectorization.3F

Topic		Replies	Views
Stanc3 optimization and analyses walkthrough during StanCon Meetings	6	1012	August 22, 2019
Unintuitive Benchmark Thread Developers maintenance	15	1172	April 15, 2019
First stanc3 release candidate! Developers	9	2137	August 19, 2019
Intel compilers General	7	1180	October 22, 2017
One Compiler Per OS Developers	103	2436	July 24, 2018

Profiling C++ code

Related Topics