I’ve been working with Nic (dev ops contractor) to improve the performance regression testing for develop and PRs and so that system has been a little unstable over the past week or two. But part of that instability seems to have been a legitimate performance regression of about 14% on at least this model https://github.com/stan-dev/example-models/blob/master/bugs_examples/vol2/schools/schools-4.stan that happened sometime after cmdstan commit eda879ea7fda18 (as in, that was a commit where performance was good and d09bd7db87dc83b838d51ee5f6c79434a3923fe9 is a commit where performance is 14% worse for that model and some of the other models in that directory.
Here is how you can try to reproduce this locally:
Does anyone have time to dig into this? Help would be appreciated in any or all of: reproducing the finding, finding out which commit caused this, figuring out why that commit caused it, and fixing it :D
Is there an easy way to isolate the PR where this landed to be able to review what changed? Or do we have to backtrack through commits for some reason?
It seems likely it was the one right after, but we were messing with the job so much that I don’t totally trust it. The one right after was https://github.com/stan-dev/performance-tests-cmdstan/commit/e424b29baa23d94ae6ff862ae3683f5df58b004c which looks like the recent ODE PR from @wds15. One thing to do would be to run the last known good cmdstan hash, eda879ea7fda18, against this hash, e424b29baa23d94ae6ff862ae3683f5df58b004c, with this command:
@rok_cesnovar thanks for reproducing, seems like we definitely have to look into this before the next release.
@wds15 It may not be that PR, that just happened to be the next one when it started failing, but like I mentioned we were messing with this Jenkins job a lot so I don’t totally trust it. We should be able to sort of git bisect our way between develop and eda879ea (via merge commits only) and figure out which PR caused it…
Stan Math on ec52b8bec098013eb907f947df821b6a62950b5a (merge of #1072 - ode time varying)
Elapsed Time: 285.212 seconds (Warm-up)
178.563 seconds (Sampling)
463.775 seconds (Total)
Stan Math on 5e76d67f378e29a5aa41c6f921b7af3185696793 (merge of #1212 - TLSv5)
Elapsed Time: 317.544 seconds (Warm-up)
196.51 seconds (Sampling)
514.054 seconds (Total)
EDIT: Both were run with
example-models/bugs_examples/vol2/schools/schools-4 method=sample data file=example-models/bugs_examples/vol2/schools/schools.data.R random seed=1234
As it seems that the TLSv5 is the culprit I can look into things - so thanks for spotting this. There was a last minute change and a deliberate design decision which come to my mind which I can follow up on.
@seantalts Do you know if this is the only model or are there more models affected by this? If possible I would like to understand what models are affected and was hoping there is an overview on model-performance.
I am running this on Ubuntu with g++ 7.3.0. Re-running now with clang. If you have an idea on what could be the cause and its easy to change let me know. I have the performance tests set up.
I reverted just before merging the v5 the access to the AD tree instance from using the raw pointer towards using a method. You can revert this by undoing two commits which are hopefully easy to find. Doing so will avoid having to rely on compiler optimisation for fast AD instance access.
The AD instance is held as a pointer to a global object. For the non-threaded case you could eliminate the need for a pointer. This means a few more ifdefs in the autodiffstackstorage.hpp. Before this PR we did not use pointers in the non-threaded case.
The typical route here is to just revert the whole PR atomically so that develop stays nice while we figure out how to get the PR in. I’ll submit a PR to revert 5e76d67f378e29a5aa41c6f921b7af3185696793 now and we can have another PR with that revert reverted along with whatever fixes it needs.
From our ragged collection of models that we run with the performance test suite, it’s all but one of the models in this directory: example-models/bugs_examples/vol2/schools. The schools-2 model in that directory seems unaffected. There may be other ones but as I mentioned we were messing around with the performance tests during that time. The revert PR will likely highlight others.
@rok_cesnovar thank you very much for isolating the commit hash that caused this performance regression! Hopefully we can figure out what caused that and then figure out how to deal with it somehow so we still get threading on Windows. We may discover that we just don’t care about whatever pattern those models are using, haha.
At least on my system I cannot really confirm such a bad impact. Reverting kicks threading for Windows out of the boat… we should have stronger evidence against the TLSv5 PR, I think, before we actually revert.