Does anyone have experience profiling C++ code? @yizhang maybe? We keep coming up with speed concerns and guessing at what is slow, but it seems like even benchmarking can be extremely misleading.
@Matthijs and @wds15 currently want to learn how to do this to figure out performance issues. What have you guys tried so far? For windows, I’ve heard Very Sleep can be a good first pass. For Mac, I’ve heard Instruments from Xcode is the tool to use. But very curious if someone has a tutorial or has a list of gotchas (like how to deal with compilers that optimize our benchmarks away, etc).
I use valgrind and tau. But for heavily templated code like stan it’s worth to examine instantiation as well. For that we can try templight. Valgrind/Cachegrind/callgrind is the most accurate with multi-threading run but comes with a price of significant slowing-down.
For MPI run, in addition to end-to-end test, we need break it down the profiling into computing and communication. tau can examine both.
I heard gperftools is much faster than valgrind but less accurate.
gperftools worked nicely for me before on some pretty low level Stan stuff: Adjoint sensitivities & Adjoint sensitivities (those are two different links – Discourse is automatically squishing them). Dunno about automated perf tests though.
I used Xcode’s Instruments extensively when developing Nomad. Instruments has some super slick UI features, like mapping C++ to the corresponding machine instructions and runtime percentage, access to CPU diagnostic buffers, etc that were extremely useful. That said, as with many performance tools heavily inlined code can be hard to analyze.
I’ve only profiled on a Mac using Instruments. Unfortunately, the Time Profiler functionality of Instruments wasn’t actually something that provided useful results for profiling. If I recall, it would indicate that a large percentage of time was spent in the error checking functions. (When @bgoodri ran gdb, I believe he also saw similar results.) When running code that doesn’t have the debug instructions, it isn’t really where the time is being spent.
We discussed profiling quite a bit on the old google groups. Here are some relevant threads:
I can add the following observation (which I’m sure is obvious to most of you):
based on my experience and the top answer here, I think that naively inserting timing statements in your C++ is something to be careful about.
Presumably, using a profiler is a better idea, but I don’t have any experience with those.
Here’s a good post showing the way around LLVM, which I might suggest is useful in looking at the generated IR (instead of assembly) and seeing what you can do with a modular multi-phase compiler architecture.
Our code behaves very very differently at different optimization levels. It relies very heavily on -O3 optimization, particularly inlining and static template evaluation and code flow evaluation to remove unused branch predictions.
I just got the performance test down from 1.67s to 1.11s through profiling, finding out that the hotspot was actually in matrix_mat_vari::chain, and changing an explicit Eigen allocation to a C++11 auto. This is why profiling is awesome!
I will submit a PR with this to make sure it’s not crazy. Also I haven’t had any issues - I just put this in my make/local:
CXXFLAGS += -g
Recompiled things, and on a Mac, ran
instruments -t "Time Profile" ./test/performance/logistic
(9 there because this is my 9th invocation of the instruments profiler)
So excited - I think we can probably improve this particular thing over the codebase. With more performance tests, we can find more of these.
Yeah, going to spend a lot of tomorrow with it and talking this through with @Matthijs. Want to make sure I’m not crazy about this particular optimization though - here’s the PR: https://github.com/stan-dev/math/pull/801. I think only the changes on lines 349 and 352 were necessary (changing the declaration of adjB to auto from VectorXd adjB(size_)), but wanted to change the rest as much as I easily could for symmetry.
Here’s a screenshot of Instruments reading the trace generated with the above command:
I am maybe hijacking this thread too much, but I was wondering if anyone had tried turning on vectorization? It’s one of the first things Eigen’s FAQ recommends, but we don’t have it turned on anywhere: