Old end-to-end model tests

@seantalts, this is for you.

A long time ago, we had run all the BUGS examples to test for a few things:

  1. test for compilation (we don’t need this now)
  2. check that we didn’t break things between commits (we found a lot of things by checking end-to-end that we missed with unit tests)
  3. get an overall sense of speed (we ran this on Jenkins and just tracked the overall runtime; little slips in gradients that passed tests would often get caught here)

We had removed them due to too many false positives and @betanalpha had a framework for testing the correctness in a different way. They still provided some value since we don’t really have end-to-end tests now.

Here’s an example test:

Maybe we should start testing something like this and tracking performance?

Cool! I have my Python script building and running these models on their data and recording summarized output in a diffable format.

My branch with runPerformanceTests.py is here: https://github.com/stan-dev/cmdstan/tree/perf

In this branch, I added the example-models repo as a submodule, do some slight trickery to try to find model / data file pairs, then compile and run them, recording the output to tests/golds (which I was hoping to check in) and writing a timing file to times.csv (not checked in, but tested always by Jenkins on a specific machine).

I’m not sure how useful the golds will be if we can’t figure out any way to get reproducibility at least for like, clang + OS X or some reasonable pair.

PS What’s the other framework @betanalpha had? Curious about how to do any kind of portable regression testing here…