Upgrading the models used for performance testing

Just to provide some context – the initial stat comp benchmarks models were intentionally designed to be simple, low-dimensional models that provided an extraordinary low bar for new algorithms. In particular they all broke the ADVI implementation that is still in Stan.

They are not representative of the diversity of user models and way too simple for performance benchmarks – in my opinion whether a small model runs in 1 second or 2 seconds is irrelevant, and changes start to matter only when the models start taking many minutes if not hours to run. When transitioning to Stan3 @seantalts just needed some suite of models to benchmark and regression test against, and these were conveniently available so he just grabbed them. I do not believe that they were ever intended to be permanent.

I don’t think that performance tests aimed at capturing the effects of changes in the math library need to span any particular set of user models but rather they should be designed to be capture certain parts of the math library. A matrix heavy model, a large ODE model, etc. Given the existing testing burden tuning the models to run in 1-5 minutes wouldn’t increase the testing burden much but would lead to much more accurate timings, as @rok_cesnovar notes.

4 Likes