We have been circling around this topic for a few times already.
We have a great system set up for performance testing on the math/stan/cmdstan repos but I believe we should revisit on what models are used or at least the sizes of the input data.
We currently use the models from here: https://github.com/stan-dev/stat_comp_benchmarks/tree/master/benchmarks
The input data used is also from the same repository.
An example of the output is: https://github.com/stan-dev/math/pull/1844#issuecomment-616205640
some models have too short execution times which means a lot of noise and dependence on IO. I have seen stuff like 5% improvements/regressions on typo fixes. A model with an execution time of 0.2 seconds will always be hard to test with. So we should upgrade those with larger input sizes.
it might be nice to add some more representative models to this test suite. Like a common model used in rstanarm, brms, a model with the algebra solver, reduce_sum, etc.
more comprehensive compilation time test is needed. See Compilation time evolution in cmdstan
Some models could be added to the PR test suite, others to the daily performance test of cmdstan. If anyone feels some models used currently is not representative, that is also a welcome comment.
So if anyone has any model/data combination that they think are good candidates for the performance test suite, please link them here.