Upgrading the models used for performance testing

We have been circling around this topic for a few times already.

We have a great system set up for performance testing on the math/stan/cmdstan repos but I believe we should revisit on what models are used or at least the sizes of the input data.

Status quo

We currently use the models from here: https://github.com/stan-dev/stat_comp_benchmarks/tree/master/benchmarks
The input data used is also from the same repository.

An example of the output is: https://github.com/stan-dev/math/pull/1844#issuecomment-616205640


Dont get me wrong, this is great and really nice work from @seantalts and @serban-nicusor, but it could use a few upgrades in my view:

  • some models have too short execution times which means a lot of noise and dependence on IO. I have seen stuff like 5% improvements/regressions on typo fixes. A model with an execution time of 0.2 seconds will always be hard to test with. So we should upgrade those with larger input sizes.

  • it might be nice to add some more representative models to this test suite. Like a common model used in rstanarm, brms, a model with the algebra solver, reduce_sum, etc.

  • more comprehensive compilation time test is needed. See Compilation time evolution in cmdstan

Some models could be added to the PR test suite, others to the daily performance test of cmdstan. If anyone feels some models used currently is not representative, that is also a welcome comment.

So if anyone has any model/data combination that they think are good candidates for the performance test suite, please link them here.


Thanks for posting this. I’m also interested in benchmarks which involve models which have longer execution times.

I’m currently developing a set of benchmarks which will stress-test serialization speed (e.g., to csv). The benchmarks need to test models which serialize different numbers of parameters each draw. So I’m also interested in models which have longer execution times.

My short-term plan is really simple. I’m going to make a model called 80,000 schools which expands 8 schools to 80,000 schools using random data.

A better solution might be to develop some performance tests of various sizes. I rather like Bazel’s way of classifying test sizes:

Size RAM (in MB) CPU (in CPU cores) Default timeout
small 20 1 short (1 minute)
medium 100 1 moderate (5 minutes)
large 300 1 long (15 minutes)
enormous 800 1 eternal (60 minutes)

From: https://docs.bazel.build/versions/master/be/common-definitions.html#common-attributes-tests

For our purposes, ram isn’t too important. The implied length of the test is.


I recommend using posteriorb as it makes easy to use different sets of posteriors (posterior is model plus data) and this is one the use cases why we are building it. You make a PR for new posteriors, or we can help to add them.


Oh yeah. That is a great idea. We could also switch to using posteriordb for the evaluation of all PRs in terms of numerical precision as well as to measure performance.

Since it already has everything https://github.com/stan-dev/stat_comp_benchmarks has and much more. And the repository has not been updated for three years.

Not sure who decides whether to do it or not :) I am definitely in for making our performance tests more informant. It would also extend the current precision testing. We can discuss that on next weeks meeting, its a bit late to add it for this one.

I think this is a great idea, but I don’t think the weekly online meetings are a good forum for discussion, as it favors those devs who have the time to be there.* I propose we keep technical discussions in the open on Discourse, and substantive proposals should be filed via the DesignDocs repo, which is a good way to hash out the details.

*I said this ( and more) in last week’s weekly meeting.

1 Like

@s.maskell - alerting you to this discussion - not sure if your test models overlap with this set.

I agree. I will make design doc explaining what would have to be done and where we could use it.

1 Like

Just to provide some context – the initial stat comp benchmarks models were intentionally designed to be simple, low-dimensional models that provided an extraordinary low bar for new algorithms. In particular they all broke the ADVI implementation that is still in Stan.

They are not representative of the diversity of user models and way too simple for performance benchmarks – in my opinion whether a small model runs in 1 second or 2 seconds is irrelevant, and changes start to matter only when the models start taking many minutes if not hours to run. When transitioning to Stan3 @seantalts just needed some suite of models to benchmark and regression test against, and these were conveniently available so he just grabbed them. I do not believe that they were ever intended to be permanent.

I don’t think that performance tests aimed at capturing the effects of changes in the math library need to span any particular set of user models but rather they should be designed to be capture certain parts of the math library. A matrix heavy model, a large ODE model, etc. Given the existing testing burden tuning the models to run in 1-5 minutes wouldn’t increase the testing burden much but would lead to much more accurate timings, as @rok_cesnovar notes.


Thank you @betanalpha!

Exactly the context I was missing and needed.


For what it’s worth, I think it would be really great if we could collate a set of real-world Stan models that capture the size, complexity and variety that the community of Stan users considers.

My understanding is that the best current guess of such a set is that here: https://github.com/stan-dev/example-models/. This repo has the advantage of being a long list of models (~450 working models modulo a pull-request or two), but I think it would be fair to say that they are far from either curated or representative. At present, I think the purpose of that repo is simply to check that nothing pathological happens: we are certainly using those models (locally to my team) to see if our novel algorithms ever degrade performance and/or to understand when the performance enhancements we are working to deliver are maximised.

However, if we add to it and include tags (eg to denote models that are: well-coded; or set up to use data simulated from the model itself; or sampling from a distribution that has an analytic solution; or sampling from a distribution for which is also in posterior dB (here: https://github.com/MansMeg/posteriordb/); or small; or enormous; or whatever) then we should be able to get away from trying to have one set of models that we can use for all purposes and move towards having intersecting sets of models where we can use each set for one of the multiple things we want to do with a collection of models.


This is why we are building posteriordb. See use cases https://github.com/MansMeg/posteriordb/blob/master/doc/use_cases.md and make a pull request if you want to add something.


@avehtari: good plan. We will do as you suggest.

There’s nothing in Stan that outputs different numbers of parameters per draw. It’d break all of our downstream analysis processes.

Whoever wants to put together the PR.

1 Like

I think this is where the misunderstanding’s coming from. This was just a completely ad hoc place to dump models we wanted to share, used in doc, people sent us to test, we wrote into case studies, etc. It was never intended to be either robust to running or comprehensive in terms of types of models.

That repo contains pathological models that Stan can’t fit as coded using its current samplers. So I’ve never understood how this set of models would be useful for testing.


@Bob_Carpenter: Thanks for explaining the genesis of the example-models repo.

I should probably clarify that, as people developing new algorithms and wanting to gain confidence that they work well by testing them in the context of lots of specific models, we see the posteriors in example-models as providing a useful set of regression tests. So, I’m thinking of using the models to test that any changes don’t introduce new pathological issues.

I’d be happy if our new algorithms fall over in the same way as Stan does but worried if they fell over more often and/or if our algorithms experience different pathological behaviours that we didn’t see with Stan. Does that make more sense? Of course, it would be preferable to have a large database of models that are representative of Stan’s use in the real world (and I’m keen to contribute to posteriordb such that we have such a database), but the example-models repo has the key advantage that it exists today.

I underestand that pragmatism is what’s leading people to want to use a set of pre-existing models for regression testing despite the fact that some of them don’t work.

1 Like