Performance Benchmark Testing Part II

Performance Benchmarking

My previous article details an approach for creating Performance Benchmarks.  I will build on the foundations laid out and describe how a powerful process was created on a customer site using these principles.

Benchmarking Problem:

I’m going to describe how I took a benchmarking process that took 2+ days, involved lots of manual intervention, was prone to error, and created a solution.  This was run unaided overnight and allowed the analysis and identification of problem areas within 100’s of transactions at the glance of an eye. The benefits were: increased productively; increased utilization of valuable time on a scarce performance test environment; decreased cycle time between benchmark runs.

Let’s say you have created a consistent set of isolated benchmark tests.   There are over 15+ different performance benchmark tests and within them you have over 100+ individual transactions types that are being reporting against.  So for example: there may be a search function executed from a UK, American and Asian region (so the main transaction consists of 3 different but same type transactions).  It’s useful to compare not just individual performance response times, but also average CPU’s across the entire n-tier architecture.   Previously it was easy to compare by quickly looking at a few graphs it becomes impractical with this volume of data, particularly if benchmark tests are being executed several times a week.  This is the situation that I faced at a previous company. We created a consistent set of benchmark tests with a lot of consistent numbers – I then found some of the performance engineers spending the best part of a day attempting to compare and resolve differences.

Comparing 100’s of graphs simply wasn’t going to work; we had to use the summary numbers to tell us which graphs to investigate.  I sat back and thought – and the answer was very simple. Averages weren’t enough; min and max values were pointless.  The answer lay in the Averages, 90th percentiles and sample rate for transactions.  A collated spreadsheet was put together: transaction rate, average transaction response time and 90th percentile in the first 3 columns for the previous benchmark, the same in the second column (for the 2nd benchmark).  The Third column contained the differences, delta for transaction rates, percentiles and averages. We also had the percentage delta’s.

So A few Points:

Performance Deltas

Why delta’s for percentages and actual differences? Because the transaction response times were extremely small (0.15 seconds in some cases).  So a small change in the transaction response time sometimes meant a very large change in the actual percentage response time.  If a change from 0.15 seconds rose to 0.30 this would show a 100% increase response time, but would actually be unperceivable from an end user perspective.  The business could then easily sign these results off as acceptable.

Transaction Per Second (TPS)

A high number of transactions improve the statistical confidence for the reported results.  It also allows verification checks to ensure the different benchmark are also ‘in line’ with each other.  If these averages deviate too much I know that particular a benchmark test will require inspection – a lot of the transactions are either failing or something else is amiss.  In short, this number gives a measure of confidence in the quality of the test – and helps answer “Is there a problem with the Performance Test or the System Under Test?”

Average and Nth percentiles:

How do you reflect the story of a ‘flat line’ within a graph within a single number?  An average simply cannot do this – there are many different graphing lines that will conspire to give the same average. However, taken with the 90th percentile this will give a high confidence level that the graphs for two runs look identical.  The percentile will also smooth out any ‘blibs’ such as the occasional CPU spikes for CPU graphs (too many and it won’t).   I found that if both these two Online Casino numbers deviated radically from a previous runs’ numbers then this would be an area that needed immediate investigation.   If one of these numbers differed then there ‘probably’ wasn’t an issue with the system (we would still look).  More often than not, a change in just one of the values would mean that the static test data we were using needed reconfiguring.   Now there are other more sophisticated mathematical formulas we could use to detect deviation from correlated patterns – but that would take effort, time and outside processing.  It was just much easier to leverage the values that were immediately available and easily understood by other performance engineers and stakeholders.

Running a series of Performance Benchmarks back to back:

Running 15+ isolated benchmark tests simply wasn’t practical manually. i.e. Start a benchmark test – run test1 for 40 mins –  run off the stats and then start test2 for 40 mins, stop, run off the stats….15 times  We needed to combine them into a series of sequential runs, run back to back (i.e. in a singe test) and then output each of the results from the individual runs as a separate set of statistics – ready for feeding into the main summary spreadsheet.   The warm-up times then needed to be ‘ironed out’ from the statistics for each of the runs within the run.  A piece of code was written to interrogate the Loadrunner database after a run and extract the appropriate statistics. This worked fantastically well – All the benchmark tests could be combined into a single test and run back to back overnight (as opposed to someone stopping and starting all 15 of them during the day).

Performance Transaction Benchmark Comparison:

Excel has a very nice ‘heat map’ feature – this allows your eye to be naturally drawn to the areas that generated the largest deltas.  I personally found the spreadsheet beautiful – instead of having to compare 100’s of graphs, the spreadsheet allowed anyone to instantly hone into any potential problem areas with very little analysis.   It was also very simple but powerful.  I think any performance engineer looking at it will think “that’s an obvious and natural way of comparing benchmarks”. But it wasn’t.  I’m sure there are plenty of performance engineers that would find this a powerful and useful feature to have in any performance tool.  A lot of the tools I’ve used allow you to ‘overlay graphs’ and ‘compare results’ … but none that I know of provide a way of ‘compare this run to that and tell me which graphs I need to investigate further’.

CPU Performance Benchmark ScreenShot

Standard Deviation / Min / Max

I found Min and Max to be simply useless. Standard Deviation not immediately as useful as the other values and cluttered the spreadsheet (so hidden). It’s all about keeping it clean, uncluttered and easy to understand.

The whole approach was a success:

  • Performance engineers could spend time investigating issues rather than mindlessly running tests and producing test results.
  • Productivity was increased and as a result job satisfaction was.
  • What previously took the best part 2+ man-days could be left to run overnight and analyzed within an hour.
  • Valuable performance hardware resource was freed up during the day and more importantly during working hours
  • Performance Test cycle times were radically reduced – great for the ever increasing drop cycles in an Agile environment.
  • The solution is scalable – it’s easy to go from 100+ transactions to 1000+ transactions
  • The solution not only lends itself well to CPU and transaction response times –but also other flat line measures that you wish to compare against.

My favorite part was we had created a process which increased productivity in the best possible way. We could genuinely do more with a lot less man effort, how many new processes genuinely give you that?

Related:

Reducing Risk through Regular Integrations

Why performance can’t be guaranteed

My Top Ten Tips for Performance Testing


Leave a Reply

Your email address will not be published. Required fields are marked *