This post is a peek behind the curtain of the next major update to Smoketest which I hope to have completed shortly: Performance Case visualisations.
Smoketest has always had two types of test case that you could implement by deriving from two distinct base classes:
TTestCase is the base class for correctness testing.
TPerformanceCase is the base class for performance testing.
The performance cases were what I used some years ago to compare the string handling performance of the RTL in different versions of Delphi around the time that the switch was made to Unicode.
I noted at the time that I was compiling the results manually to obtain the comparison data, but the raw data for those comparisons was coming from a Smoketest project that I put together to perform the actual tests.
I had always intended to implement improved data capture and visualisations and I am now finally getting around to it.
It is still very much a work in progress at this stage, but I have capture and comparisons implemented for running the same test project compiled with different Delphi versions and comparing the results in the Smoketest GUI itself:
You can’t run all the different Delphi versions of a set of tests from within the same EXE of course, so there are some mechanics behind this which will need to be explained later on.
There is also the facility to compare results between two different test cases in the same project:
As might be more noticeable in this example, currently all test results are “normalised” against the worst result in each case. i.e. the fact that the bars for the first three WIDE results (blue) in the above test are the exact same length as each other and as the green ANSI bars in the remaining five tests, does not mean that the performance is equal.
What it does tell us is that in the first three cases the WIDE test yielded the worst result but in the remaining cases it was the ANSI test that did least well.
I also have some other comparisons in mind but these may come later. It’s still early days yet and I shall blog about how performance cases and visualisations work in Smoketest in more detail when this work is nearer completion.
some silly questions:
1. in the upper pic you have 5 bars, but in the legend there are only 4
D-Versions mentioned.
Is the 5th bar for the next future version ? 😉
2.you call your pics “Performance Data Visualisation”.
You really mean performance, or are there possibly the timespans needed visualized? i don’t really know which version i should assume…
You saw the part where I noted that this was early days of a work in progress ? 😉
The legend drawing code is just enough to get something visible and there is a lot still to do there in particular. e.g. Currently it assumes 2 columns of labels rather than calculating the number that will fit. The legend is actually drawn correctly – the problem currently is a simple one in the calculation of the height of the legend container panel itself, which I have simply decided not to worry about until I have completed the column calculations.
I also need to separate those calculations from the drawing code, for efficiency. Currently everything is calculated as it is drawn, just to get something drawn. 🙂
I’m not sure what your second point is.
If you mean that there is no time scale, that is deliberate and explained.
The worst performing case in each data set is the 100% bar – all other performances are shown relative to that. Currently. 😉
If absolute measures are something that may be useful then some alternate presentations of the data could be implemented. But most often performance comparisons are relative by nature, to answer the question “Does code X perform better/worse than code Y ?”
Using an absolute time scale would also potentially pose problems when comparing methods in a case with significantly different performance characteristics. If one method is inherently orders of magnitude slower than others then using an absolute scale – consistently – for all methods in the case would mean that the relative differences being compared in the faster cases would be much harder, due to the demands of the longer scale arising from the slowest method.
This would require either much greater sophistication in the presentation of the results to avoid the need for test writers to specifically organise their tests in cases comprising only of similarly performing methods, which is an arbitrary system of organisation that will not always fit very well with testing intentions.
Interesting.
Rather than comparing Delphi versions, is there a way to compare against (automatically gethered) historic results and report only the significant variations? Mostly interested in spotting performance regressions or improvements.
Not as yet, but identifying performance regressions and treating them as a “test failure” is an objective in this area.
I really like the idea of performance-testing unit tests! Please post again when you have more, I’ll be interested to read it.
Re result normalization: I understand the reasons, but I think it’s a bit confusing. It would be nice to see the timescales, at least, and when tests are comparable show similar timescales (ie if everything executes within 5-10 seconds, have each graph use a ten-second scale.) It would be good to have the graph scales noted so we can see the actual numbers too.
Comparing to historical performance sounds like a great feature! I’m very, very interested in this. Here’s a use-case I have in mind:
– I have a test suite running a test
– For each test, I want to run a few variations with, say, different $defines or different units included in the project, to compare these against each other
– For each test I’d like to see historical results, to know if overall I’m getting faster or slower, if I’ve improved performance 10x since last week, etc.
Thanks for the input David. Historical performance testing is the trickiest to pin down in terms of what is going to be broadly useful so the additional use case you provided there is very helpful to put alongside my own usage. Thanks! 🙂