@seantalts, this is for you.
A long time ago, we had run all the BUGS examples to test for a few things:
- test for compilation (we don’t need this now)
- check that we didn’t break things between commits (we found a lot of things by checking end-to-end that we missed with unit tests)
- get an overall sense of speed (we ran this on Jenkins and just tracked the overall runtime; little slips in gradients that passed tests would often get caught here)
We had removed them due to too many false positives and @betanalpha had a framework for testing the correctness in a different way. They still provided some value since we don’t really have end-to-end tests now.
Here’s an example test:
Maybe we should start testing something like this and tracking performance?
Cool! I have my Python script building and running these models on their data and recording summarized output in a diffable format.
My branch with runPerformanceTests.py is here: https://github.com/stan-dev/cmdstan/tree/perf
In this branch, I added the example-models repo as a submodule, do some slight trickery to try to find model / data file pairs, then compile and run them, recording the output to tests/golds (which I was hoping to check in) and writing a timing file to times.csv (not checked in, but tested always by Jenkins on a specific machine).
I’m not sure how useful the golds will be if we can’t figure out any way to get reproducibility at least for like, clang + OS X or some reasonable pair.
PS What’s the other framework @betanalpha had? Curious about how to do any kind of portable regression testing here…