Being Proactive about Performance

What’s the best way to improve performance of the application? In a word ‘Prevention’.  In my opinion the best way of doing this is to proactively monitor and measure how the application is actually being used in live by customers.  Most performance testing is actually a simplified representation of live behavior – what happens in the testing phase can be very different from what happens in the real world. It’s the difference between theory and practice and it’s particularly applicable when new functionality is released to clients, estimates made are subject to change.

So what needs to happen? I recommend having visibility on the application – and if a client hasn’t got it the first thing I recommend post performance testing is to increase visibility on the live system, this can be achieved by:

In order of priority

  1. Consolidation of log files:  This means piping all the log files into the same place so someone can see what is happening at a glance.  I worked on a client site where they had over 30 instances of the same service on different machines – they hadn’t thought to consolidate the log files into one place.  This meant that they weren’t aware of any issues until customers complained, they then had to log onto different machines (different passwords for each instance), find file locations and then look though log files – usually well after the incident.  It was cumbersome, difficult & painstaking.  Compare that to another client site I worked on – CMC Markets.  They consolidated all log files into a single place and they could view during runtime – this had an immediate impact.  Having the messages on one screen and filtered (for warnings and errors) meant that they were proactive about spotting issues.  They could also start to spot issues developing before clients started phoning in.  This meant that the development team were also much more active in removing spurious log messages and correcting issues they would never otherwise have detected. The benefits go on… Consolidation of all log files (log4j, sql, apache …) into a single place is one of the most powerful ways of gaining visibility and proactively preventing issues.  It also enables stakeholders to get a better idea of the underlying causes leading up to an event – and not just see the effects. Doing this is a no brainer in terms of saving man effort.  I can’t recommend this enough. There are plenty of free tools out there and also commercial tools. See Splunk for an idea of the advantages that can be gained – its a powerful commercial tool but I found their pricing structure to be unacceptable.
  2. Writing Tools for reporting: A service is of much more use if you have visibility on how it is actually being used.  Write tools that will report stats for important services and components. Example:   A previous company I worked for had a high number of caches on middle tier servers – but they had no idea how they were being used in live and had no visibility.  I actioned a developer to small piece of code to link into the EHCache status using the JMX service.  This meant we could now report on the number of items, average time to live, average size of the elements, average number of elements (caches were constrained by number of elements and not size!).  This was loaded into a reviews of mh3 nject & kollagen intensiv spreadsheet and we could see immediately which caches were oversized, undersized, underutilized.  Visibility gave understanding and enabled IT to make the system work more efficiently – the system also stopped running out of heap space and crashing.   An analogy is would be akin to having a computer without task manager.
  3. DB Monitoring and continuous improvement:  Have someone always inspecting the live DB performance – looking for slow performing sql, stored procs and suspect other events.  DB’s are an entire operating systems in their own right.  Having someone look at the live DB and apply improvements continuously will reduce hardware, software overheads (unnecessary engineering) and also prevent resources being pulled in to resolve issues.  Good db admins/techies are worth their weight in gold.
  4. Human Nature: If tools are not easily accessible, usable or visible then people are much less likely to use them.  You can have the greatest functionality in the world – but if they are hard to engage with they are ineffective. Don’t under estimate the power of a good human computer interface. I’ve seen some tools with great functionality but appalling interfaces which means they only get used as a last resort.
  5. Live Monitors: CPU stats, Memory Stats and Alerts – all on a consolidated real time dashboard.  Its much more effective to have these available and easily accessible than to have to use a clunky interface that is difficult to navigate, read and access.  This comes lower down in terms of priority as this is looking at effect rather than the underlying cause.
  6. Internal Application Monitoring:  There are a breed of tools that can be used in a live environment relatively unobtrusively. These give visibility on method calls, waits and threading issues. E.g. Total time spent calling SQL, most frequency used method calls etc.  They tend to be expensive, complex to install and have consultancy services sold in on the back of them.  For that reason this is last on my list.  DynaTrace is a good starting point if you wish to find out more about these types of tools.

Think of a formula 1 racing car – the engineers have monitors for absolutely everything.  They need this visibility to quickly diagnose causes of issues and improvement performance. Application system performance is no different.  I’ve been on client sites where issues have occurred on live and it has taken weeks to root out the cause because of insufficient detail and lack of visibility. This also causes considerable drag on key resources.

Key Takeaways:

  • Performance doesn’t just stop in the lab; there should be continuous monitoring and improvement being applied to the live system
  • Don’t underestimate human nature. Ease of use for tools means they will be used more frequently
  • Visibility is key. If you don’t have easy visibility on what is happening in the system  you can’t be proactive.  If you aren’t being proactive you aren’t being preventative – prevention is a lot cheaper than cure.
  • Visibility is also key to diagnosing live issues. Live systems are usually quickly re-started in the even of an issue. Lots of head scratching ensues and incorrect guesses can be made – having disperate logging in a single place can help evidence theories and piece together crime scenes.

Follow these guidelines to decrease the number of live performance issues and save countless man hours.

See also:

Why performance can’t be guaranteed 

What to do when live performance issues occur 

One thought on “Being Proactive about Performance

  1. Logs can be very useful in estimating the performance.

    It depends on the system if logs can be used or not. For some systems, when they go live, logs must be only used on error level, because on debug level they would take up resources and affect the system’s performance.

    So there’s no straight recipe for all the systems, and a clear analyze should be made to plan the testing.

    As the system goes live, testing costs might be very high. So eventually there could be a small period of time dedicated to testing the system on live, just after launch.

    Testing on the live system may also be part of maintenance testing, which usually takes place after major upgrades or migrations.

Leave a Reply

Your email address will not be published. Required fields are marked *