Performance tests reports

Hi!

First, the automatic performance test reports are great. Here are a few things which we could improve/I have questions:

  • How to interpret the numbers… I mean is >1 better or <1? I am never sure.
  • Could the posted report include a link to the actual Jenkins run? Or some reference which makes it easy to find the respective logs? I struggle with that a lot.
  • Would it be possible to output a few more details like compiler variant, compiler version, os & CPU used?
  • Would it also be possible to request performance tests to run on a specific platform (linux/macOS/Windows)?

The above is ordered in importance according to my taste, of course.

Tagging @seantalts & @serban-nicusor .

Best,
Sebastian

it’s old / new so higher is better. We can add text to that effect. The next two should also be easy to add (@serban-nicusor can you add this to your list? thanks!) And Nic is already working on the last one as well, but is having some difficulties running the tests on Windows.

Should we report back 1 - old / new. Then it’s just negative number == bad while positive number == good

Yeah, I think 1 - new / old is probably what you meant - then if old is 120 seconds and new is 125 seconds we get 1-125/120 = -0.04, so 4% slower. We can also literally add text like “4% slower” and I think we should do that with your formula :)

2 Likes

If we are talking about these:

relativeUnstableThresholdPositive: 5

relativeFailedThresholdPositive: 10

It’s based on a treshold so as of now, if one of the test performs +10% then, it means that as of the last time it ran its computation time increased by 10%, therefore a slow down. Our number is 5% so our 10% above will fail the build.

And you have defined the sign conventions correctly? I would easily mess this up given these definitions as greater numbers in the reports is better, but you suggest that we stop things if numbers are largish.

That’s how it’s supposed to be working.
See: https://jenkins.io/doc/pipeline/steps/performance/

We’re talking about a couple different things here - Nic is talking about the build we run every time we merge to CmdStan master that tests that results haven’t changed much. Sebastian is talking about the relative performance tests that run and comment on github PRs. The results from that are not obviously interpretable in the current format mostly because it’s not clear how they’re computed, but we can do the stuff above in Sebastian’s original pose and using Steve’s 1 - new / old suggestion with the text “% slower/faster” or whatever next to it to try to help - and probably just print the formula for good measure, too. Then that will get included on the pull request and be easier to interpret for newcomers.

Yes, wip.

Yes, it would be handy to have a list of what details, commands to be in the comments.

  • Would it also be possible to request performance tests to run on a specific platform (linux/macOS/Windows)?

See: https://github.com/stan-dev/performance-tests-cmdstan/tree/custom
( Scroll down for quick info )

@wds15

this custom build thing looks cool… I am trying it… let’s see where I get the results (and I am really curious on seeing the performance comparison on windows for the TBB branch)

Hmm… I started a test build and got errors wrt to gold comparisons… are estimates compared to gold? Does that make sense?

stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse method=sample num_samples=1000 num_warmup=1000 data file=stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.data.R random seed=1234 output file=golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold.tmp
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param sigma.1 |0.953794217 - 1.127675695| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param sigma.2 |1.091161228 - 0.929372467| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param mu.1 |-0.756782903386 - 0.302696568708| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param mu.2 |0.21614638705 - -0.892994255085| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param theta |0.4531778815 - 0.585157671| not within 2e-08

I would like to compare performance and not gold results.

(but this looks super cool)

@wds15

You can see the PR here: https://github.com/stan-dev/performance-tests-cmdstan/pull/24
An example of the output here: https://github.com/stan-dev/stan/pull/2761

Please take a look and let me know if there should be any changes before merge.

1 Like

Just looked at it… and it is a lot better to what we had before.

  • the links do not work (Blue Ocean & Jenkins Console Log)
  • what is confusing to me is that I see clang and gcc info (great)… but what did we use to run the performance tests?
  • could we add again the total speedup / slowdown (excluding the compilation)?

The links are broken because of the job number in the url, which points to a separate job I made to test this.

An example of Jenkins Console Log would look like this: https://jenkins.mc-stan.org/job/stan/view/change-requests/job/PR-2761/12/consoleFull

  • changed 95 to 12

Blue Ocean:
https://jenkins.mc-stan.org/blue/organizations/jenkins/Stan/detail/PR-2761/12/pipeline

  • Changed 95 to 12
  • changed stan to Stan, blue ocean is case sensitive. Will push a fix in 1 min.

Hmh, what do you mean by that ? Can you please provide an example ? Thanks

Ratio keeps the result of old/new.
See: https://github.com/stan-dev/performance-tests-cmdstan/blob/e44eb5f1c1c2aad91f1a5dfdc20aae727d7f7f71/comparePerformance.py#L31

I hope this is what you mean by speedup/slowdown

Love it! How long would it take to run these models like 6-10 times? I’m looking at the eight schools example here which is 8% faster on your PR (nice speedup :-P). I think the results would be a bit more consistent if we took the average run time from a few runs.

And yes I do see we take the average below, but there are times where individual model results can be interesting. One example, I was changing up some things in the memory model a while ago and noticed that some performance tests were faster while some were slower. Turns out the change was v good for models that used vectors but not things that used looped over an array