Possible performance regression

Hey all,

I’ve been working with Nic (dev ops contractor) to improve the performance regression testing for develop and PRs and so that system has been a little unstable over the past week or two. But part of that instability seems to have been a legitimate performance regression of about 14% on at least this model https://github.com/stan-dev/example-models/blob/master/bugs_examples/vol2/schools/schools-4.stan that happened sometime after cmdstan commit eda879ea7fda18 (as in, that was a commit where performance was good and d09bd7db87dc83b838d51ee5f6c79434a3923fe9 is a commit where performance is 14% worse for that model and some of the other models in that directory.

Here is how you can try to reproduce this locally:

git clone --recursive https://github.com/stan-dev/performance-tests-cmdstan
cd performance-tests-cmdstan
./compare-git-hashes.sh example-models/bugs_examples/vol2/schools/ develop eda879ea7fda18 false false

Does anyone have time to dig into this? Help would be appreciated in any or all of: reproducing the finding, finding out which commit caused this, figuring out why that commit caused it, and fixing it :D


Is there an easy way to isolate the PR where this landed to be able to review what changed? Or do we have to backtrack through commits for some reason?

It seems likely it was the one right after, but we were messing with the job so much that I don’t totally trust it. The one right after was https://github.com/stan-dev/performance-tests-cmdstan/commit/e424b29baa23d94ae6ff862ae3683f5df58b004c which looks like the recent ODE PR from @wds15. One thing to do would be to run the last known good cmdstan hash, eda879ea7fda18, against this hash, e424b29baa23d94ae6ff862ae3683f5df58b004c, with this command:

git clone --recursive https://github.com/stan-dev/performance-tests-cmdstan
cd performance-tests-cmdstan
./compare-git-hashes.sh example-models/bugs_examples/vol2/schools/ eda879ea7fda18 e424b29baa23d94ae6ff862ae3683f5df58b004c false false

But I won’t be able to get to it until Friday at the earliest.

I was able to reproduce it on the schools4.stan example:

 Elapsed Time: 319.341 seconds (Warm-up)
               197.179 seconds (Sampling)
               516.521 seconds (Total)
 Elapsed Time: 284.337 seconds (Warm-up)
               177.94 seconds (Sampling)
               462.277 seconds (Total)

Thanks for catching this…sounds like I should investigate…

Looking at the changeset of The Pr it only touched ode files…so it’s a bit weird to see this impact…need to check further…

@rok_cesnovar thanks for reproducing, seems like we definitely have to look into this before the next release.

@wds15 It may not be that PR, that just happened to be the next one when it started failing, but like I mentioned we were messing with this Jenkins job a lot so I don’t totally trust it. We should be able to sort of git bisect our way between develop and eda879ea (via merge commits only) and figure out which PR caused it…

I left cmdstan on develop and just moving stan-math a few commits back.

With Stan Math on d3254859c9cb13663ea08e1d842319efd2de6cc7:

 Elapsed Time: 286.831 seconds (Warm-up)
               179.54 seconds (Sampling)
               466.371 seconds (Total)

Its only one run but I am assuming this is before the regression. There werent a lot of merges after this one.
EDIT: werent

After the TLSv5 merge with 5e76d67f378e29a5aa41c6f921b7af3185696793 we get the regressed performance:

 Elapsed Time: 327.117 seconds (Warm-up)
               199.453 seconds (Sampling)
               526.57 seconds (Total)

So its either 1212, 1072 or 1066 merge. Going to run it on the commit before the TLSv5 merge tomorrow morning if that would help.

1 Like

I confirmed the suspicion:

Stan Math on ec52b8bec098013eb907f947df821b6a62950b5a (merge of #1072 - ode time varying)
Elapsed Time: 285.212 seconds (Warm-up)
               178.563 seconds (Sampling)
               463.775 seconds (Total)

Stan Math on 5e76d67f378e29a5aa41c6f921b7af3185696793 (merge of #1212 - TLSv5)
 Elapsed Time: 317.544 seconds (Warm-up)
               196.51 seconds (Sampling)
               514.054 seconds (Total)

EDIT: Both were run with

example-models/bugs_examples/vol2/schools/schools-4 method=sample data file=example-models/bugs_examples/vol2/schools/schools.data.R random seed=1234
1 Like

That’s not nice. What OS and compiler is this on?

As it seems that the TLSv5 is the culprit I can look into things - so thanks for spotting this. There was a last minute change and a deliberate design decision which come to my mind which I can follow up on.

@seantalts Do you know if this is the only model or are there more models affected by this? If possible I would like to understand what models are affected and was hoping there is an overview on model-performance.

I am running this on Ubuntu with g++ 7.3.0. Re-running now with clang. If you have an idea on what could be the cause and its easy to change let me know. I have the performance tests set up.

There are two things which come to my mind:

  1. I reverted just before merging the v5 the access to the AD tree instance from using the raw pointer towards using a method. You can revert this by undoing two commits which are hopefully easy to find. Doing so will avoid having to rely on compiler optimisation for fast AD instance access.

  2. The AD instance is held as a pointer to a global object. For the non-threaded case you could eliminate the need for a pointer. This means a few more ifdefs in the autodiffstackstorage.hpp. Before this PR we did not use pointers in the non-threaded case.

I hope that makes sense.

Its the same story with Clang. Tried reverting those commits but no luck. Will try a more systematic approach later.

1 Like

The typical route here is to just revert the whole PR atomically so that develop stays nice while we figure out how to get the PR in. I’ll submit a PR to revert 5e76d67f378e29a5aa41c6f921b7af3185696793 now and we can have another PR with that revert reverted along with whatever fixes it needs.

From our ragged collection of models that we run with the performance test suite, it’s all but one of the models in this directory: example-models/bugs_examples/vol2/schools. The schools-2 model in that directory seems unaffected. There may be other ones but as I mentioned we were messing around with the performance tests during that time. The revert PR will likely highlight others.

revert PR: https://github.com/stan-dev/math/pull/1244

@rok_cesnovar thank you very much for isolating the commit hash that caused this performance regression! Hopefully we can figure out what caused that and then figure out how to deal with it somehow so we still get threading on Windows. We may discover that we just don’t care about whatever pattern those models are using, haha.

Your instincts were correct. More on #1244

1 Like

For the record: On my macOS system with a clang 6.0 compiler I get


 Elapsed Time: 242.328 seconds (Warm-up)
               153.76 seconds (Sampling)
               396.088 seconds (Total)

and for 5e76d67f378e29a5aa41c6f921b7af3185696793 it is:

 Elapsed Time: 253.73 seconds (Warm-up)
               155.496 seconds (Sampling)
               409.226 seconds (Total)

At least on my system I cannot really confirm such a bad impact. Reverting kicks threading for Windows out of the boat… we should have stronger evidence against the TLSv5 PR, I think, before we actually revert.

and when I revert the commits 49d3583 and ace8424, then I endup with

 Elapsed Time: 231.003 seconds (Warm-up)
               145.117 seconds (Sampling)
               376.12 seconds (Total)

So this version holds the AD tape in a pointer, but we are not using a method to access it.