Guy Warren, CEO at ITRS.
Technological glitches, especially when it comes to banking applications, can lead to major reputational damage. Lloyds TSB learnt this the hard way when frustrated customers took to Twitter to criticise the company, as they were unable to access their bank accounts between April 21st-22nd.
What started as a botched IT migration spiralled into chaos with up to 1.9 million customers encountering login problems. A number of customers also were able to view information about other customers. The FCA and the Information Commissioner’s Office are investigating the issue and have the authority to fine the bank for the failed system upgrade and alleged data breach. The reputational damage to TSB will likely be even greater.
At least superficially, what customers demand is simple. They want 24/7 access to banking applications, without delay or error. However, from the banks’ perspective, achieving these high levels of application availability and production stability in today’s highly complex application landscape is an increasingly tough challenge. This is the stuff of nightmares for banking CIOs, who stay up at night fearing a production meltdown and crowds of angry customers.
No organisation, however large or small, is immune to productions issues. Even a large, experienced organisation, which has been running the IT environment for many years can have serious issues, as demonstrated by the TSB incident. The rate of change is increasing as new platforms and improvements to the systems are put live at an ever-increasing rate.
More than ever, IT strategy needs to be aligned with providing maximum availability for users. There are four distinct disciplines needed to achieve Production Stability:
- Testing – It may be obvious, but the quality of the testing of a change directly impacts production stability. The testing must not only cover the new functionality that is being put live, but also the existing functionality which is expected to be un-changed (regression testing). Where the software is going to run on a variety of different computers, the testing of all versions and all platforms/browsers/OS. The testing must also cover so called ‘non-functional’ testing, testing the performance and stability of the software under load, ensuring that the change hasn’t degraded performance or introduced memory leaks or other issues. It should also test that the supportability is effective, with instrumentation to support diagnostics, and proper monitoring is possible (see point number 4).
- Resilient Architecture – In the modern world, we have to expect that software or hardware will fail and applications needs to be designed to cater for this. Either manually or automatically, the application needs to switch to a new platform to continue delivering the service. The more critical the application, the more automated fail over needs to be. Single points of failure (items which if they fail can’t be replaced quickly or easily) need to be understood and monitored very carefully.
- Change Management – This is a skill and competency all of its own. Detailed ‘roll in’ and roll out’ plans, practice and timings, with ‘go’ and ‘no go’ decision points, post implementation testing and verification are all key to having solid change management in the enterprise. Inevitably, the more changes you attempt to tackle in your change windows, the higher the risk of a change not working.
- Proactive Monitoring & Alerting – In a large IT infrastructure with thousands or tens of thousands of computers, it is impossible for humans to watch all the critical metrics on all the boxes. You require monitoring tools collecting the data and checking for abnormal or unacceptable conditions, and sending alerts or taking action directly to rectify the condition. If you require very high availability, then the data collection and the rules engine need to be real time collection and continuous evaluation of alert conditions, as it is with ITRS Geneos. Most tools average or aggregate the data, and only run the rules on an interval basis of 1 minute or 5 minutes.
High availability is not easy to achieve. But for traditional banks to remain competitive in an era of increasing competition from fintech challengers, they will need to continuously improve in all of these disciplines, whilst still remaining responsive to the changes that the business needs.