Upgrading the models used for performance testing

rok_cesnovar · April 30, 2020, 11:46am

We have been circling around this topic for a few times already.

We have a great system set up for performance testing on the math/stan/cmdstan repos but I believe we should revisit on what models are used or at least the sizes of the input data.

Status quo

We currently use the models from here: https://github.com/stan-dev/stat_comp_benchmarks/tree/master/benchmarks
The input data used is also from the same repository.

An example of the output is: https://github.com/stan-dev/math/pull/1844#issuecomment-616205640

Upgrades

Dont get me wrong, this is great and really nice work from @seantalts and @serban-nicusor, but it could use a few upgrades in my view:

some models have too short execution times which means a lot of noise and dependence on IO. I have seen stuff like 5% improvements/regressions on typo fixes. A model with an execution time of 0.2 seconds will always be hard to test with. So we should upgrade those with larger input sizes.
it might be nice to add some more representative models to this test suite. Like a common model used in rstanarm, brms, a model with the algebra solver, reduce_sum, etc.
more comprehensive compilation time test is needed. See Compilation time evolution in cmdstan

Some models could be added to the PR test suite, others to the daily performance test of cmdstan. If anyone feels some models used currently is not representative, that is also a welcome comment.

So if anyone has any model/data combination that they think are good candidates for the performance test suite, please link them here.

ariddell · April 30, 2020, 2:01pm

Thanks for posting this. I’m also interested in benchmarks which involve models which have longer execution times.

I’m currently developing a set of benchmarks which will stress-test serialization speed (e.g., to csv). The benchmarks need to test models which serialize different numbers of parameters each draw. So I’m also interested in models which have longer execution times.

My short-term plan is really simple. I’m going to make a model called 80,000 schools which expands 8 schools to 80,000 schools using random data.

A better solution might be to develop some performance tests of various sizes. I rather like Bazel’s way of classifying test sizes:

Size	RAM (in MB)	CPU (in CPU cores)	Default timeout
small	20	1	short (1 minute)
medium	100	1	moderate (5 minutes)
large	300	1	long (15 minutes)
enormous	800	1	eternal (60 minutes)

From: https://docs.bazel.build/versions/master/be/common-definitions.html#common-attributes-tests

For our purposes, ram isn’t too important. The implied length of the test is.

avehtari · April 30, 2020, 2:19pm

I recommend using posteriorb as it makes easy to use different sets of posteriors (posterior is model plus data) and this is one the use cases why we are building it. You make a PR for new posteriors, or we can help to add them.

rok_cesnovar · April 30, 2020, 2:54pm

Oh yeah. That is a great idea. We could also switch to using posteriordb for the evaluation of all PRs in terms of numerical precision as well as to measure performance.

Since it already has everything https://github.com/stan-dev/stat_comp_benchmarks has and much more. And the repository has not been updated for three years.

Not sure who decides whether to do it or not :) I am definitely in for making our performance tests more informant. It would also extend the current precision testing. We can discuss that on next weeks meeting, its a bit late to add it for this one.

mitzimorris · April 30, 2020, 3:01pm

I think this is a great idea, but I don’t think the weekly online meetings are a good forum for discussion, as it favors those devs who have the time to be there.* I propose we keep technical discussions in the open on Discourse, and substantive proposals should be filed via the DesignDocs repo, which is a good way to hash out the details.

*I said this ( and more) in last week’s weekly meeting.

mitzimorris · April 30, 2020, 3:13pm

@s.maskell - alerting you to this discussion - not sure if your test models overlap with this set.

rok_cesnovar · April 30, 2020, 3:21pm

I agree. I will make design doc explaining what would have to be done and where we could use it.

betanalpha · May 1, 2020, 2:20am

Just to provide some context – the initial stat comp benchmarks models were intentionally designed to be simple, low-dimensional models that provided an extraordinary low bar for new algorithms. In particular they all broke the ADVI implementation that is still in Stan.

They are not representative of the diversity of user models and way too simple for performance benchmarks – in my opinion whether a small model runs in 1 second or 2 seconds is irrelevant, and changes start to matter only when the models start taking many minutes if not hours to run. When transitioning to Stan3 @seantalts just needed some suite of models to benchmark and regression test against, and these were conveniently available so he just grabbed them. I do not believe that they were ever intended to be permanent.

I don’t think that performance tests aimed at capturing the effects of changes in the math library need to span any particular set of user models but rather they should be designed to be capture certain parts of the math library. A matrix heavy model, a large ODE model, etc. Given the existing testing burden tuning the models to run in 1-5 minutes wouldn’t increase the testing burden much but would lead to much more accurate timings, as @rok_cesnovar notes.

rok_cesnovar · May 1, 2020, 5:59am

Thank you @betanalpha!

Exactly the context I was missing and needed.

s.maskell · May 4, 2020, 3:56pm

For what it’s worth, I think it would be really great if we could collate a set of real-world Stan models that capture the size, complexity and variety that the community of Stan users considers.

My understanding is that the best current guess of such a set is that here: https://github.com/stan-dev/example-models/. This repo has the advantage of being a long list of models (~450 working models modulo a pull-request or two), but I think it would be fair to say that they are far from either curated or representative. At present, I think the purpose of that repo is simply to check that nothing pathological happens: we are certainly using those models (locally to my team) to see if our novel algorithms ever degrade performance and/or to understand when the performance enhancements we are working to deliver are maximised.

However, if we add to it and include tags (eg to denote models that are: well-coded; or set up to use data simulated from the model itself; or sampling from a distribution that has an analytic solution; or sampling from a distribution for which is also in posterior dB (here: https://github.com/MansMeg/posteriordb/); or small; or enormous; or whatever) then we should be able to get away from trying to have one set of models that we can use for all purposes and move towards having intersecting sets of models where we can use each set for one of the multiple things we want to do with a collection of models.

avehtari · May 7, 2020, 8:18pm

This is why we are building posteriordb. See use cases https://github.com/MansMeg/posteriordb/blob/master/doc/use_cases.md and make a pull request if you want to add something.

s.maskell · May 7, 2020, 8:57pm

@avehtari: good plan. We will do as you suggest.

Bob_Carpenter · May 18, 2020, 12:57am

There’s nothing in Stan that outputs different numbers of parameters per draw. It’d break all of our downstream analysis processes.

Whoever wants to put together the PR.

Bob_Carpenter · May 18, 2020, 1:00am

I think this is where the misunderstanding’s coming from. This was just a completely ad hoc place to dump models we wanted to share, used in doc, people sent us to test, we wrote into case studies, etc. It was never intended to be either robust to running or comprehensive in terms of types of models.

That repo contains pathological models that Stan can’t fit as coded using its current samplers. So I’ve never understood how this set of models would be useful for testing.

s.maskell · May 19, 2020, 7:59am

@Bob_Carpenter: Thanks for explaining the genesis of the example-models repo.

I should probably clarify that, as people developing new algorithms and wanting to gain confidence that they work well by testing them in the context of lots of specific models, we see the posteriors in example-models as providing a useful set of regression tests. So, I’m thinking of using the models to test that any changes don’t introduce new pathological issues.

I’d be happy if our new algorithms fall over in the same way as Stan does but worried if they fell over more often and/or if our algorithms experience different pathological behaviours that we didn’t see with Stan. Does that make more sense? Of course, it would be preferable to have a large database of models that are representative of Stan’s use in the real world (and I’m keen to contribute to posteriordb such that we have such a database), but the example-models repo has the key advantage that it exists today.

Bob_Carpenter · June 5, 2020, 10:59pm

I underestand that pragmatism is what’s leading people to want to use a set of pre-existing models for regression testing despite the fact that some of them don’t work.

Topic		Replies	Views
Template for example-models? Developers maintenance	12	1395	November 19, 2016
Old end-to-end model tests Developers maintenance	3	717	March 22, 2018
Stan Performance Test Developers	42	2488	March 28, 2018
thinLTO / standard benchmarks? Developers	13	1719	March 30, 2018
Bayesian Benchmarking 1.0 General	6	711	July 20, 2021

Upgrading the models used for performance testing

Related topics