Stan Performance Test

seantalts · August 18, 2017, 4:55pm

The Stan Performance test only passes on a specific compiler + OS combination. Anyone mind if I pin it to the Jenkins node where it passes?

syclik · August 18, 2017, 6:36pm

Please do.

syclik · August 18, 2017, 6:37pm

We’ll get an upgraded version up soon.

syclik · August 19, 2017, 2:33am

As with lots of things, a quick implementation of something that was meant to be very narrow in scope is now being used for more than its original intention, but is useful as it’s saved us from releasing some bad bugs. And that’s why it’s still there.

I’ve been writing up a wiki for gold master tests; need to just finish it out and push it. Then implement it and replace these tests.

syclik · October 30, 2017, 7:37pm

I’m reviving this thread again. It’s coming up here:
https://github.com/stan-dev/stan/issues/2429

@seantalts, I’m thinking the right way to do this is to have a test suite outside of the Stan repo. What do you think? The alternative is to run it under one git hash, then update and run again in a new git hash.

seantalts · October 30, 2017, 7:59pm

Are you talking about replacing the performance test with gold tests and then having those tests live in a separate repo? The killer feature (at least in the places I’ve worked) of gold tests is that you have them in the same repo so you can see exactly when your API changed and what code caused it.

Bob_Carpenter · November 1, 2017, 5:42pm

Has anyone ever followed up on the reported issue with 2.17 being slower than 2.16? @mitzimorris verified the problem wasn’t in the code generation, which hasn’t changed.

syclik · November 1, 2017, 8:57pm

Yes. But maybe that’s because I’m not clever enough to reason about this. Here’s what I’m thinking about:

we need to specify two git hashes. Let’s call that A and B.
we want to check out the repo at A, run the tests, stash the output of the test, then do the same for B
while we do that, we want to time the tests
we then want to compare the output of A and B, ignoring differences in things we know should differ (this will be in output; reported timing numbers)

For a particular OS / compiler / compiler options / git hash, we should be able to save the output of the test, but for any other combination, it doesn’t apply. That’s where I thought it made sense to be outside of the repo.

Thoughts? (I have most of this written in a draft wiki page; it’s hard writing about this. Can anyone point me to any resources about this sort of testing?)

seantalts · November 1, 2017, 9:03pm

I’m not really sure what you’re describing - that doesn’t sound like gold tests I’ve heard of or used. The ones from my experience are very simple - you run some commands that produce text output and you check in that output into the repo. You need a command for replacing that output with whatever is currently generated (to be reviewed by humans on commit and PR) and then tests are run comparing current text output with checked-in text output.

syclik · November 1, 2017, 9:09pm

That’s why I didn’t think they were gold tests at first.

But… that’s what we need to do if we want to verify that things work on a machine across git hashes, right? There might be a simpler way. The issue is that the behavior conditioned on seed is not guaranteed across platforms. Even if the same RNG was used, floating point arithmetic kills it. (There was a time when we had the same, identical output for Windows and Mac based on seed. We changed the seed and they drifted. Unit tests will pass.)

seantalts · November 1, 2017, 9:13pm

Huh. I recuse myself from the discussion, haha. I have no knowledge there and it sounds very complicated.

Bob_Carpenter · November 1, 2017, 10:39pm

I think you’re talking about the same thing. Rather than all this git hash dancing, you want to store a version of what it should be in the repo, then keep testing against that. That’s the gold test.

Then we’re talking about speed regression tests. There, you have a target of what it means to succeed and timings from previous runs. That’s the regression test.

syclik · November 1, 2017, 11:07pm

Ok. We have most of that now. We need a way to record new output easily and then we’re set.

Our gold master tests would be machine / os / compiler / compiler option dependent. That was the problem I was attpting to solve and what I believe was suggested I solve. If that isn’t true, then it’s an easier problem and minor modifications need to be made (still work, but it’s almost right).

syclik · November 2, 2017, 2:09pm

It took me way too long to put this up:

It’s just a draft.

But… based on what I’ve heard recently on this thread, what we have is a decent start for gold master testing. And it needs to live on one machine.

seantalts · March 22, 2018, 2:19pm

I’m taking a crack at a Python script to do gold tests the way we did them at a previous job, and I think I have a good starting point but I’m having trouble getting the compilers to generate code that does the same thing on different machines. Does anyone know if there are ways to force math to be the same at the expense of speed, perhaps?

In my version I wanted to check in the gold files to git so we have a meaningful record of what changes over time with each version. And we could check in the version that is generated by a specific Jenkins machine, but this makes it impossible or very difficult to run the tests locally in a meaningful way (this problem exists with our current performance test).

sakrejda · March 22, 2018, 2:27pm

99% sure that’s not a thing. We already do what we can so that things are reproducible within a single machine but different compiler versions are free to generate different code (among other things).

jjramsey · March 22, 2018, 2:29pm

If I recall correctly, the problem was fixed here: Change std::string back to char* (PR #584) for performance reasons · Issue #667 · stan-dev/math · GitHub

seantalts · March 22, 2018, 3:28pm

I certainly can’t find much about it, though people think standardizing on sse2 and that kind of thing helps a lot. I’m pretty curious how our unit tests test specific numeric outputs but the end-to-end stuff isn’t reproducible… What parts of the full inference pipeline are leading to this divergence?

sakrejda · March 22, 2018, 3:35pm

Our unit tests have tolerances (?)

seantalts · March 22, 2018, 3:37pm

The unit test tolerances are really pretty tight, whereas with the end-to-end stuff things can be off by many thousands (1e-8 vs 1e4).

Topic		Replies	Views
Stan 2.20.0 to be released July 18th 2019 Developers	27	2019	July 19, 2019
Reflection on the v2.17.0 stan-dev/math performance bug Developers bug	4	960	February 9, 2018
Jenkins Updates, Issues & Requests Developers	55	2254	February 14, 2020
Custom CmdStan Performance Tests Developers	0	312	December 9, 2019
Jenkins upstream testing broken Developers	0	365	July 2, 2018

Stan Performance Test

Related topics