Reflection on the v2.17.0 stan-dev/math performance bug

I’ve been wanting to write this down for a while. Just finding time and enough clarity now to do so.

tldr: my bad. We fixed it and released v2.17.1. We have measures in place to prevent an issue like this one.

What happened

We’ve had an open issue, stan-dev/math#460, to replace const char* with std::string.

It was assigned to me and I had some time, so I submitted a pull request, stan-dev/math#584, fixing the issue. It passed all tests in math. Just to be clear, this didn’t introduce a bug where behavior was incorrect. It slowed everything down. The pull request was reviewed and merged.

The problem was that I should have checked the performance of this change before submitting the pull request. Unfortunately, our unit tests for math don’t cover performance. I should have marked on the pull request that there was a possible performance hit. I don’t recall if I thought about that at the time (it was a large number of files changed at once and I think I was really happy that I got every instance correct; this might have been my 3rd attempt at fixing this issue).

Typically, I had been monitoring the performance tests in the stan-dev/stan repo to make sure large performance changes don’t go unnoticed. Recently before that pull request, there were changes to the math library that legitimately changed the performance and this coincided with less involvement from me on the continuous integration front. And to simplify the process, changes to the math library were being automatically merged into stan without a human (me) involved looking at the output. Anyway, I should have been paying attention to it, but it slipped. It was released with v2.17.0.

Thanks to this post by @jjramsey, we identified the cause of the slow down. @seantalts and @bgoodri stepped up and fixed the merge conflicts in git and we were able to patch both develop and release a v2.17.1.

By the way, it’s still a good idea to change the code to make it cleaner. Hopefully static std::string will have the same performance as const char*.

Why it won’t happen (in the same way) again

Jenkins is now set up to notify if the performance changes in stan-dev/stan by a large amount. Thanks, @seantalts, for setting that up! (I had tried for a while, but never gotten it to work, which was why I was monitoring it by hand.)

What exactly is this testing? It’s running a (poorly coded) logistic regression that’s run 100 times with the same seed and records the timing (roughly 1.3 sec per run). It checks to see that:

  1. the last draw matches hard-coded values
  2. each run produces the exact same last draw

It’s run on the same machine every day and on every stan-dev/stan change (including updates to the math submodule). Given that it runs on the same machine with the same compiler, cpu, etc. what this tells us is that things run exactly the same between changes to the code (or it doesn’t and we should know about it), that we haven’t introduced randomness that destroys reproducibility, and a rough estimate of performance. It’s not a perfect test, but it helps with making sure we haven’t slipped at those things. Like a lot of things in Stan, it started out as a simple way to test timing, but then it evolved and we tacked on checking output to make sure things don’t change because we didn’t have an end-to-end test for that.

Things to think about

Here are just some thoughts and perhaps others can chime in and contribute some more tests:

  • we don’t have any automated performance tests at the stan-dev/math level; we should add some
  • the current tests are hard to update; it’d be nice if they were easier to update
  • perhaps we should bring back the BUGS tests; for a long time, we ran all the BUGS examples and used their estimates of mean as a way to verify that our code didn’t break. We also used it as a crude performance test. When it was proposed to get rid of it (due to false positives), the current performance test was created so that we didn’t completely lose all end-to-end testing
  • we had a thread about gold master testing, but I think the conclusion was that isn’t what we’re after. It’d be nice to introduce some more automated end-to-end tests to prevent performance bugs like this

Anyway, I just wanted to recap. I created the pull request that slowed everything down. I really should have mentioned the potential performance issue (even without tests). And I should have tracked this one through the whole process. But thankfully, we fixed it, and the fix has been released. More importantly, something that affects performance like this one won’t stay in develop long thanks to Jenkins.

1 Like

Thanks for this post. I’m just catching up on all the threads I missed.

And indeed, thanks to @seantalts. Automating all this stuff is the key to scaling.

I’d very much like to do this in an automated way with a broader set of performance regression tests. No idea at all how to do it, but I think it’s important like the autodiff tests.

Rather than the BUGS tests, which have bad parameterizations and are thus probably both biased and unpredictable, how about the verified models Michael’s defined in the test framework repo?

Just a small thought: I don’t think it is necessary to tie performance and correctness tests together (as your posts seem to imply). You could just run a set of models with given data (possibly multile datasets per model) with fixed seeds and measure wall-clock time (or n_eff/time) - this should be fairly easy to setup and give you good info. You could even create some nice performance graphs over time, so you see how are you doing - something like this: https://perf.rust-lang.org/index.html?start=&end=&absolute=true&stat=wall-time

They don’t need to be linked in the code.

They are tied methodologically because we are only interested in performance conditioned on correctness.

We have these: http://d1m1s1b1.stat.columbia.edu:8080/job/Daily%20Stan%20Performance/lastSuccessfulBuild/artifact/test/performance/performance.png

Not that pretty, but it’s enough to get a sense of what’s going on.