Performance tests reports


First, the automatic performance test reports are great. Here are a few things which we could improve/I have questions:

  • How to interpret the numbers… I mean is >1 better or <1? I am never sure.
  • Could the posted report include a link to the actual Jenkins run? Or some reference which makes it easy to find the respective logs? I struggle with that a lot.
  • Would it be possible to output a few more details like compiler variant, compiler version, os & CPU used?
  • Would it also be possible to request performance tests to run on a specific platform (linux/macOS/Windows)?

The above is ordered in importance according to my taste, of course.

Tagging @seantalts & @serban-nicusor .


it’s old / new so higher is better. We can add text to that effect. The next two should also be easy to add (@serban-nicusor can you add this to your list? thanks!) And Nic is already working on the last one as well, but is having some difficulties running the tests on Windows.

Should we report back 1 - old / new. Then it’s just negative number == bad while positive number == good

Yeah, I think 1 - new / old is probably what you meant - then if old is 120 seconds and new is 125 seconds we get 1-125/120 = -0.04, so 4% slower. We can also literally add text like “4% slower” and I think we should do that with your formula :)


If we are talking about these:

relativeUnstableThresholdPositive: 5

relativeFailedThresholdPositive: 10

It’s based on a treshold so as of now, if one of the test performs +10% then, it means that as of the last time it ran its computation time increased by 10%, therefore a slow down. Our number is 5% so our 10% above will fail the build.

And you have defined the sign conventions correctly? I would easily mess this up given these definitions as greater numbers in the reports is better, but you suggest that we stop things if numbers are largish.

That’s how it’s supposed to be working.

We’re talking about a couple different things here - Nic is talking about the build we run every time we merge to CmdStan master that tests that results haven’t changed much. Sebastian is talking about the relative performance tests that run and comment on github PRs. The results from that are not obviously interpretable in the current format mostly because it’s not clear how they’re computed, but we can do the stuff above in Sebastian’s original pose and using Steve’s 1 - new / old suggestion with the text “% slower/faster” or whatever next to it to try to help - and probably just print the formula for good measure, too. Then that will get included on the pull request and be easier to interpret for newcomers.

Yes, wip.

Yes, it would be handy to have a list of what details, commands to be in the comments.

  • Would it also be possible to request performance tests to run on a specific platform (linux/macOS/Windows)?

( Scroll down for quick info )


this custom build thing looks cool… I am trying it… let’s see where I get the results (and I am really curious on seeing the performance comparison on windows for the TBB branch)

Hmm… I started a test build and got errors wrt to gold comparisons… are estimates compared to gold? Does that make sense?

stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse method=sample num_samples=1000 num_warmup=1000 data file=stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/ random seed=1234 output file=golds/
FAIL: golds/ param sigma.1 |0.953794217 - 1.127675695| not within 2e-08
FAIL: golds/ param sigma.2 |1.091161228 - 0.929372467| not within 2e-08
FAIL: golds/ param mu.1 |-0.756782903386 - 0.302696568708| not within 2e-08
FAIL: golds/ param mu.2 |0.21614638705 - -0.892994255085| not within 2e-08
FAIL: golds/ param theta |0.4531778815 - 0.585157671| not within 2e-08

I would like to compare performance and not gold results.

(but this looks super cool)