which I run on randomly generated linear + gaussian noise data.
I am actually running the Stan model, for 400 iterations, rather than benchmarking the C++.
I don’t know if this will make a difference, but the include order is off in the C++. It includes Eigen/Dense before stan/math.hpp, which may instantiate our template metaprograms incorrectly. I’m not sure if that’s gonna explain some of the speed difference, but it could.
And here are the plots for the function when I’ve commented out all of the check_finite implementations like so:
(void) function;
(void) name;
(void) y;
return;
// if (!value_of(y).allFinite()) {
// for (int n = 0; n < y.size(); ++n) {
// if (!(boost::math::isfinite)(y(n)))
// domain_error_vec(function, name, y, n, "is ",
// ", but must be finite!");
// }
// }
@seantalts, I have deleted them from the normal_id_glm_lpdf.hpp file.
Are you still comparing with the version where the checks are performed on each matrix element? (I am comparing with the version where the checks are done on the downstream quantity; not the data matrix.) I can try to do a comparison where the checks are on each matrix element.
I’m doing this on develop, whatever is on there. Maybe doing the check downstream gets rid of almost all of it, if that’s not on develop yet. Are other functions called by normal_id_glm_lpdf that themselves might be calling check_finite? that could aid too. I think when trying to ascertain the legitimacy of profiling results we need to be considering all of the places that call check_finite as contributing to its profiling timings.
I’m running another benchmark now for the version with the old check_finite on the matrix. (That’s what’s still in develop.)
In all my comparisons, I have replaced the definition of all the checks with
inline check(…) { }
so it should get rid of the checks everywhere in the code base.
I have additionally (this wasn’t necessary, but I was paranoid) removed all the checks from the normal_glm file.
I haven’t profiled this particular model recently, but I have in the past and check_finite definitely showed up. I used a very unsophisticated profiling method. I ran in a debugger and paused and kept track of where I landed in the code.
Edit: @seantalts, here’s a graph comparing with the normal glm that is currently in develop. Rplots.pdf (8.3 KB)
I don’t see that the checks slow anything down.
The only thing I can imagine is that Tadej tested this for very large n and k, while I did k=50 and n=200-1000.
I can try to redo it for very large n and k, but it will take forever to run.
Also I would recommend getting a real profiler :P Windows has Intel VTune, which is supposed to be among the best.
[edit] - one major reason being that with a real profiler, you can compile with -O3 (and no debug symbols) and still get results. I believe you might need -fno-omit-frame-pointer on Linux but Chandler Carruth claims this has very limited impact.
I now ran the model for n = 5000, k = 1250 and saw the following: Rplots.pdf (5.2 KB)
Very minor speed difference.
Looks like Tadej did n = 20 000, k = 5000. I don’t think my computer is fast enough to run that in a reasonable amount of time. I might try overnight. I find it difficult to imagine that the speed difference would suddenly get so big though.
I mean, this silly error check (which I have since removed) is of order n * k, so I would expect it to take 16 times as long. That suggests to me that the speed difference can at most get 16 times as large. Currently, it’s in the order of 1.5%. So then it could get as large as 20% if only the error check ends up taking longer, but nothing else, which is a ridiculous assumption. In reality, this means the speed difference will be much smaller than 20%.
I don’t think computer performance is linear for a variety of reasons :P You can always try running the model end to end with only a few iterations. What’s your goal at this point? I think for me the primacy of profiling has held up so far and we’ve discovered a lot of ways to make functions like the GLMs faster. Looking forward we might ask if the is_finite flags that we’re talking about adding to Eigen types in https://github.com/stan-dev/math/pull/1014 are worth it, and I think for that purpose we’ll likely want to benchmark that specific solution against develop. I don’t think you’re obligated to run further with this checking thread if you don’t want to.
Rplots.pdf (5.2 KB)
n = 12000 and k = 3500.
Not quite @tadej’s n = 20000 and k = 5000, but I seem to run out of memory on my machine if I try any bigger.
@seantalts, yooo… just want to say thanks for the Use Eigen’s allFinite to cut execution time… PR! I finally got to digging into the changes and it’s awesome. Looks like we can rethink some of our error checking, especially when it comes to Eigen.
@tadej, thanks for a clean performance test. I’d really like to see this get into the math repo somehow. @seantalts, do you agree? I don’t know if you saw this comment (no worries if you didn’t). You should remove the #include <Eigen/Dense> include or move it after #include <stan/math.hpp>. That include will instantiate templates in a different order and how it runs may not be representative of how Stan actually runs. Also, for readability, it’d help to include stan/math/rev/mat.hpp.
Thanks, all, for showing me a benchmark where the check actually shows up in the runtime. We’ve hunted for an example for years and I knew there were cases out there. This one clearly shows up. (Just FYI, based on my timing, the other check_finite() calls don’t make a dent to runtime on my config.)
@seantalts, at this point my intention is just to verify the finding using end-to-end tests. The reason for my caution is that for me, the end to end tests and C++ unit performance tests just differed so much. It could just have been because I was doing the C++ testing wrong, while @tadej is doing it right. It just drove home the point for me that testing C++ code for performance can be quite subtle, because of the optimising compiler. I guess the idea should be that the key words volatile in @tadej’s code make sure the compiler does not optimise things away / reorder lines too much? Do you have any reference to explain why this works? I would just love to understand the general principles of how to benchmark this sort of C++.
@Matthijs We want to profile a particular function. The way I did it was by simply calling std::chrono::steady_clock before and after the function call of the interest. We need to prepare data before the function call and do something with the results after. Otherwise compiler might optimize the call away or even compute the results at compile time. I used volatile on both inputs and outputs to avoid that.
The way I tested it using gcc I got the same results even without volatile. However I do not know if that would still hold true if I used a real profiler instead of calls to clock.
Which finding? I’ll see what happens when I run the full model with large data and profile it just to assuage the fear that profiling could sometimes be inaccurate - I don’t want to return to the era of intuition and cargo cult.
I think it’s a valid concern that the volatile markings in @tadej’s benchmark (or other differences between that file and a full model) change what the compiler is able to optimize.
@seantalts, the finding that the input checks in the normal glm take up about 70% of the execution time, which was established using a C++ benchmark. I just wanted to be sure that end to end tests also find that.
I’m fully with you to wanting to take an empirical approach to writing fast code. However, I also want to have some verification that our methods of benchmarking C++ are reliable as my own previous attempts at that seem to have been unreliable. Ultimately, what we care about are end-to-end Stan runs and how long they take. I just want to convince myself that the C++ benchmarks indeed are strongly correlated to the Stan runtime.
I think we could break down what you wrote about your goals into two discrete things with different action items:
Further improve performance of the CPU GLMs
Ascertain best practices for improving performance more generally (specifically benchmarking)
You mention both but I think you are actually trying to focus on the 2nd one, right? We can focus on finding ways to microbenchmark that seem to in some cases reproduce end-to-end performance, but I think at the end of the day when it comes time to make a decision we’re still going to want to see end-to-end stuff just in case, probably?
Here is my end-to-end graph of the normal_id_glm_lpdf model of develop vs the version with check_finite cut out as above, n=5000, k=5000 (there’s not really a need for a graph here but whatever):