15-20% ish performance regression

Won’t have time to dig into it today, but there’s a new performance regression between cmdstan’s current develop (ca910a0b5cb97bc87f14a5f50d2dff226b024047) and the commit immediately prior, 8f218b6c0584af995ed7e48faa8408d03cb040ee.

Anyone want to take a look? You can verify yourself with

git clone --recursive https://github.com/stan-dev/performance-tests-cmdstan
cd performance-tests-cmdstan
bash compare-git-hashes.sh develop 8f218b6c0584af995ed7e48faa8408d03cb040ee stat_comp_benchmarks/benchmarks/arK --runs 10 -j8


I am afraid that this is due to the threading stuff which went in. Thanks much for this performance test suite. I have to dive into this and use it more regularly.

This is a bit surprising as all what is happening is that a global static variable became a global static variable which is part of a static function. I thought this is easy for a compiler to get around.

So either we back the thread_local thing out / we fix it / we live with it.

Puhh… looks like we can we fix this and possibly even gain 9% performance… but please have a look at this pull. I have tested this under clang and would expect gcc to just work, but travis will tell us.

Is there a way to check these before the merges that cause bad performance?

I think we need to ifdef it out if it’s going to cause a 10% slowdown in the rest of the code.

Is there a way to independently test the two different things;

  • thread local
  • using staticl local variable in a function

The tests I ran were just the static Context() static local instance call, not thread local.

And yeah, it’s possible to run these on PRs. I think that’d take a few days to a week to set up, and I probably won’t have time in the next couple of weeks between the course and recruiting / interviewing. We could make a “new dev friendly” github issue for it though! It just requires a little git fu to deal with the submodules, but someone should be able to do it without knowing the C++ codebase much at all.

I just made a separate daily performance test and script out of it because people seemed to think we couldn’t get models or performance to line up meaningfully. But if you want to go whole hog and put it into testing every PR, I think that’s a fine idea.

The static local variable in a function is the issue right now. With that if-defing does not help - we need to go back to the old way or find other ways.

Should my current pull not work (multiple translation issues), then I am not sure if there is a quick solution. I will look more later today.

While I did run brief benchmarks these were obviously not enough. The framework from Sean is really great and I will use it from now on. I don’t think these test should go into the testing machinery as a default, but we should probably request these for performance critical things. The only issue with the framework is that you can only refer to cmdstan hashes, but not stan-math hashes. This can be managed with some manual workarounds, but the ability to specify stan-math hashes to compare would be even better - but someone needs to find the time to work through this.