Performance tests reports

wds15 · June 16, 2019, 6:01pm

Hi!

First, the automatic performance test reports are great. Here are a few things which we could improve/I have questions:

How to interpret the numbers… I mean is >1 better or <1? I am never sure.
Could the posted report include a link to the actual Jenkins run? Or some reference which makes it easy to find the respective logs? I struggle with that a lot.
Would it be possible to output a few more details like compiler variant, compiler version, os & CPU used?
Would it also be possible to request performance tests to run on a specific platform (linux/macOS/Windows)?

The above is ordered in importance according to my taste, of course.

Best,
Sebastian

seantalts · June 16, 2019, 7:02pm

it’s old / new so higher is better. We can add text to that effect. The next two should also be easy to add (@serban-nicusor can you add this to your list? thanks!) And Nic is already working on the last one as well, but is having some difficulties running the tests on Windows.

stevebronder · June 16, 2019, 9:20pm

Should we report back 1 - old / new. Then it’s just negative number == bad while positive number == good

seantalts · June 17, 2019, 5:18pm

Yeah, I think 1 - new / old is probably what you meant - then if old is 120 seconds and new is 125 seconds we get 1-125/120 = -0.04, so 4% slower. We can also literally add text like “4% slower” and I think we should do that with your formula :)

serban-nicusor · June 17, 2019, 8:32pm

If we are talking about these:

relativeUnstableThresholdPositive: 5

relativeFailedThresholdPositive: 10

It’s based on a treshold so as of now, if one of the test performs +10% then, it means that as of the last time it ran its computation time increased by 10%, therefore a slow down. Our number is 5% so our 10% above will fail the build.

wds15 · June 17, 2019, 9:03pm

And you have defined the sign conventions correctly? I would easily mess this up given these definitions as greater numbers in the reports is better, but you suggest that we stop things if numbers are largish.

serban-nicusor · June 17, 2019, 9:43pm

That’s how it’s supposed to be working.
See: https://jenkins.io/doc/pipeline/steps/performance/

seantalts · June 21, 2019, 3:11am

We’re talking about a couple different things here - Nic is talking about the build we run every time we merge to CmdStan master that tests that results haven’t changed much. Sebastian is talking about the relative performance tests that run and comment on github PRs. The results from that are not obviously interpretable in the current format mostly because it’s not clear how they’re computed, but we can do the stuff above in Sebastian’s original pose and using Steve’s 1 - new / old suggestion with the text “% slower/faster” or whatever next to it to try to help - and probably just print the formula for good measure, too. Then that will get included on the pull request and be easier to interpret for newcomers.

serban-nicusor · June 27, 2019, 8:56pm

Yes, wip.

Yes, it would be handy to have a list of what details, commands to be in the comments.

Would it also be possible to request performance tests to run on a specific platform (linux/macOS/Windows)?

See: GitHub - stan-dev/performance-tests-cmdstan at custom
( Scroll down for quick info )

@wds15

wds15 · July 3, 2019, 5:46pm

this custom build thing looks cool… I am trying it… let’s see where I get the results (and I am really curious on seeing the performance comparison on windows for the TBB branch)

wds15 · July 3, 2019, 8:10pm

Hmm… I started a test build and got errors wrt to gold comparisons… are estimates compared to gold? Does that make sense?

stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse method=sample num_samples=1000 num_warmup=1000 data file=stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.data.R random seed=1234 output file=golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold.tmp
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param sigma.1 |0.953794217 - 1.127675695| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param sigma.2 |1.091161228 - 0.929372467| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param mu.1 |-0.756782903386 - 0.302696568708| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param mu.2 |0.21614638705 - -0.892994255085| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param theta |0.4531778815 - 0.585157671| not within 2e-08

I would like to compare performance and not gold results.

(but this looks super cool)

serban-nicusor · August 19, 2019, 7:19pm

@wds15

You can see the PR here: https://github.com/stan-dev/performance-tests-cmdstan/pull/24
An example of the output here: https://github.com/stan-dev/stan/pull/2761

Please take a look and let me know if there should be any changes before merge.

wds15 · August 19, 2019, 8:03pm

Just looked at it… and it is a lot better to what we had before.

the links do not work (Blue Ocean & Jenkins Console Log)
what is confusing to me is that I see clang and gcc info (great)… but what did we use to run the performance tests?
could we add again the total speedup / slowdown (excluding the compilation)?

serban-nicusor · August 19, 2019, 8:17pm

The links are broken because of the job number in the url, which points to a separate job I made to test this.

An example of Jenkins Console Log would look like this: https://jenkins.mc-stan.org/job/stan/view/change-requests/job/PR-2761/12/consoleFull

changed 95 to 12

Blue Ocean:
https://jenkins.mc-stan.org/blue/organizations/jenkins/Stan/detail/PR-2761/12/pipeline

Changed 95 to 12
changed stan to Stan, blue ocean is case sensitive. Will push a fix in 1 min.

Hmh, what do you mean by that ? Can you please provide an example ? Thanks

Ratio keeps the result of old/new.
See: https://github.com/stan-dev/performance-tests-cmdstan/blob/e44eb5f1c1c2aad91f1a5dfdc20aae727d7f7f71/comparePerformance.py#L31

I hope this is what you mean by speedup/slowdown

stevebronder · August 20, 2019, 10:31am

Love it! How long would it take to run these models like 6-10 times? I’m looking at the eight schools example here which is 8% faster on your PR (nice speedup :-P). I think the results would be a bit more consistent if we took the average run time from a few runs.

stevebronder · August 20, 2019, 10:34am

And yes I do see we take the average below, but there are times where individual model results can be interesting. One example, I was changing up some things in the memory model a while ago and noticed that some performance tests were faster while some were slower. Turns out the change was v good for models that used vectors but not things that used looped over an array

Topic		Replies	Views
Bringing math testing time & resources down Developers maintenance	5	743	March 21, 2020
Custom CmdStan Performance Tests Developers	0	312	December 9, 2019
Possible performance regression Developers	18	571	May 16, 2019
DevOps Developers	14	734	April 24, 2019
15-20% ish performance regression Developers	6	793	April 9, 2018

Performance tests reports

Related topics