The Stan Performance test only passes on a specific compiler + OS combination. Anyone mind if I pin it to the Jenkins node where it passes?
We’ll get an upgraded version up soon.
As with lots of things, a quick implementation of something that was meant to be very narrow in scope is now being used for more than its original intention, but is useful as it’s saved us from releasing some bad bugs. And that’s why it’s still there.
I’ve been writing up a wiki for gold master tests; need to just finish it out and push it. Then implement it and replace these tests.
I’m reviving this thread again. It’s coming up here:
@seantalts, I’m thinking the right way to do this is to have a test suite outside of the Stan repo. What do you think? The alternative is to run it under one git hash, then update and run again in a new git hash.
Are you talking about replacing the performance test with gold tests and then having those tests live in a separate repo? The killer feature (at least in the places I’ve worked) of gold tests is that you have them in the same repo so you can see exactly when your API changed and what code caused it.
Has anyone ever followed up on the reported issue with 2.17 being slower than 2.16? @mitzimorris verified the problem wasn’t in the code generation, which hasn’t changed.
Yes. But maybe that’s because I’m not clever enough to reason about this. Here’s what I’m thinking about:
- we need to specify two git hashes. Let’s call that A and B.
- we want to check out the repo at A, run the tests, stash the output of the test, then do the same for B
- while we do that, we want to time the tests
- we then want to compare the output of A and B, ignoring differences in things we know should differ (this will be in output; reported timing numbers)
For a particular OS / compiler / compiler options / git hash, we should be able to save the output of the test, but for any other combination, it doesn’t apply. That’s where I thought it made sense to be outside of the repo.
Thoughts? (I have most of this written in a draft wiki page; it’s hard writing about this. Can anyone point me to any resources about this sort of testing?)
I’m not really sure what you’re describing - that doesn’t sound like gold tests I’ve heard of or used. The ones from my experience are very simple - you run some commands that produce text output and you check in that output into the repo. You need a command for replacing that output with whatever is currently generated (to be reviewed by humans on commit and PR) and then tests are run comparing current text output with checked-in text output.
That’s why I didn’t think they were gold tests at first.
But… that’s what we need to do if we want to verify that things work on a machine across git hashes, right? There might be a simpler way. The issue is that the behavior conditioned on seed is not guaranteed across platforms. Even if the same RNG was used, floating point arithmetic kills it. (There was a time when we had the same, identical output for Windows and Mac based on seed. We changed the seed and they drifted. Unit tests will pass.)
Huh. I recuse myself from the discussion, haha. I have no knowledge there and it sounds very complicated.
I think you’re talking about the same thing. Rather than all this git hash dancing, you want to store a version of what it should be in the repo, then keep testing against that. That’s the gold test.
Then we’re talking about speed regression tests. There, you have a target of what it means to succeed and timings from previous runs. That’s the regression test.
Ok. We have most of that now. We need a way to record new output easily and then we’re set.
Our gold master tests would be machine / os / compiler / compiler option dependent. That was the problem I was attpting to solve and what I believe was suggested I solve. If that isn’t true, then it’s an easier problem and minor modifications need to be made (still work, but it’s almost right).
It took me way too long to put this up:
It’s just a draft.
But… based on what I’ve heard recently on this thread, what we have is a decent start for gold master testing. And it needs to live on one machine.
I’m taking a crack at a Python script to do gold tests the way we did them at a previous job, and I think I have a good starting point but I’m having trouble getting the compilers to generate code that does the same thing on different machines. Does anyone know if there are ways to force math to be the same at the expense of speed, perhaps?
In my version I wanted to check in the gold files to git so we have a meaningful record of what changes over time with each version. And we could check in the version that is generated by a specific Jenkins machine, but this makes it impossible or very difficult to run the tests locally in a meaningful way (this problem exists with our current performance test).
99% sure that’s not a thing. We already do what we can so that things are reproducible within a single machine but different compiler versions are free to generate different code (among other things).
If I recall correctly, the problem was fixed here: https://github.com/stan-dev/math/issues/667
I certainly can’t find much about it, though people think standardizing on
sse2 and that kind of thing helps a lot. I’m pretty curious how our unit tests test specific numeric outputs but the end-to-end stuff isn’t reproducible… What parts of the full inference pipeline are leading to this divergence?
Our unit tests have tolerances (?)
The unit test tolerances are really pretty tight, whereas with the end-to-end stuff things can be off by many thousands (1e-8 vs 1e4).