First, the automatic performance test reports are great. Here are a few things which we could improve/I have questions:
How to interpret the numbers… I mean is >1 better or <1? I am never sure.
Could the posted report include a link to the actual Jenkins run? Or some reference which makes it easy to find the respective logs? I struggle with that a lot.
Would it be possible to output a few more details like compiler variant, compiler version, os & CPU used?
Would it also be possible to request performance tests to run on a specific platform (linux/macOS/Windows)?
The above is ordered in importance according to my taste, of course.
it’s old / new so higher is better. We can add text to that effect. The next two should also be easy to add (@serban-nicusor can you add this to your list? thanks!) And Nic is already working on the last one as well, but is having some difficulties running the tests on Windows.
Yeah, I think 1 - new / old is probably what you meant - then if old is 120 seconds and new is 125 seconds we get 1-125/120 = -0.04, so 4% slower. We can also literally add text like “4% slower” and I think we should do that with your formula :)
It’s based on a treshold so as of now, if one of the test performs +10% then, it means that as of the last time it ran its computation time increased by 10%, therefore a slow down. Our number is 5% so our 10% above will fail the build.
And you have defined the sign conventions correctly? I would easily mess this up given these definitions as greater numbers in the reports is better, but you suggest that we stop things if numbers are largish.
We’re talking about a couple different things here - Nic is talking about the build we run every time we merge to CmdStan master that tests that results haven’t changed much. Sebastian is talking about the relative performance tests that run and comment on github PRs. The results from that are not obviously interpretable in the current format mostly because it’s not clear how they’re computed, but we can do the stuff above in Sebastian’s original pose and using Steve’s 1 - new / old suggestion with the text “% slower/faster” or whatever next to it to try to help - and probably just print the formula for good measure, too. Then that will get included on the pull request and be easier to interpret for newcomers.
this custom build thing looks cool… I am trying it… let’s see where I get the results (and I am really curious on seeing the performance comparison on windows for the TBB branch)
Hmm… I started a test build and got errors wrt to gold comparisons… are estimates compared to gold? Does that make sense?
stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse method=sample num_samples=1000 num_warmup=1000 data file=stat_comp_benchmarks/benchmarks/low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.data.R random seed=1234 output file=golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold.tmp
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param sigma.1 |0.953794217 - 1.127675695| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param sigma.2 |1.091161228 - 0.929372467| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param mu.1 |-0.756782903386 - 0.302696568708| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param mu.2 |0.21614638705 - -0.892994255085| not within 2e-08
FAIL: golds/stat_comp_benchmarks_benchmarks_low_dim_gauss_mix_collapse_low_dim_gauss_mix_collapse.gold param theta |0.4531778815 - 0.585157671| not within 2e-08
I would like to compare performance and not gold results.
Love it! How long would it take to run these models like 6-10 times? I’m looking at the eight schools example here which is 8% faster on your PR (nice speedup :-P). I think the results would be a bit more consistent if we took the average run time from a few runs.
And yes I do see we take the average below, but there are times where individual model results can be interesting. One example, I was changing up some things in the memory model a while ago and noticed that some performance tests were faster while some were slower. Turns out the change was v good for models that used vectors but not things that used looped over an array