Recent IT disasters suggest gaps in hardware testing, backup system testing and inadequate disaster recovery plans.
Although the summer often is the most profitable season for most airlines, this past summer wasn’t great for several of the largest carriers. In July, Southwest Airlines suffered a 12-hour IT outage, triggered by a router malfunction that quickly took down many of its core operating systems like a house of cards. Worse, backup systems failed to deploy. Flight delays and cancellations continued for a few days.
A similar disaster unfolded at Delta in August, when a power control module at the airline’s central data center malfunctioned, creating a surge to the transformer and loss of power. Delta’s systems didn’t switch over to backups as planned, causing scheduling mayhem as agents couldn’t access key systems. For both companies, the outages resulted in tens of millions of dollars in lost revenues.
Sure, both of these outages affected businesses running more traditional IT infrastructure, but even if your company is running in the cloud, your business isn’t immune from a damaging outage. In September, Microsoft Azure experienced two outages lasting several hours each, one in Europe and the other affecting U.S. customers across several key services, including SQL Database, Virtual Machines, Service Bus and Visual Studio team services.
As software developers and testers, we often focus narrowly on the application and the server it is running on—but there’s a lot more to consider. Hardware and components, backup systems and third-party providers all come into play when testing applications for stability and reliability. Of course, there’s no way to prevent all bad things from happening but there are some steps you can take to cover the bases more thoroughly.
Look Beyond the Software
If there were three main lessons learned from these recent IT disasters, they would be:
- Assess risk of the whole environment. The quality of your software is only as good as the components it runs on—the hardware, network, power, storage, and other third-party applications are often required. Understand the full picture and all of the dependencies. Categorize applications based upon business criticality and failure rate to focus attention on the pieces that really matter.
- Backup systems and processes are not “set and forget.” Spend more time documenting them, testing them and training staff. Make sure that the processes are up to date as IT adds new components and vendors to the system or if the company goes through a merger or acquisition.
- Invest according to your risk profile. If your company can’t operate for more than a few minutes without access to certain systems, plan on budgeting for warm or hot failover systems which enable near 100 percent uptime with always-on, parallel systems continually synchronized and ready to go. It’s too expensive to enable this failover capability for your entire environment—you will need to be picky. Some IT assets, such as an airline’s scheduling and reservation systems, deserve top priority and resources, while other assets such as the HR system, can do with less. IT directors also need to frequently investigate the need for upgrading old, unstable infrastructure technologies. Most companies also don’t spend enough time monitoring and testing backup and recovery systems.
To minimize the risk of a major IT outage bringing down your business and alienating customers, the recipe is as follows: Analyze the points of failure, determine what is reparable and what is not easily fixed (for example, installing software usually can be done much more quickly than building a new server environment) and, finally, install and test failover systems and processes that can maintain basic operations with minimal customer disruption.
Delta & Southwest: A Quick Analysis
In both cases, Delta and Southwest had backup systems that did not deploy properly—or, at least, in a timely fashion. The first point of action would include conducting thorough, regular testing of those backup systems and locations and upgrading capabilities if necessary. Second, since no disaster recovery system is foolproof, create manual workarounds for critical system outages. That would have helped tremendously so that customers weren’t completely stranded. Third, create a fully redundant second data center that switches over automatically when the primary data center goes down. Naturally, this is the most expensive option, but depending upon the outcome of the risk analysis, it may be worth the money.
With Southwest, a router failure caused a cascading effect that brought down the network altogether. It’s not clear if a secondary network was available for failover, although it appears the answer is no. Again, better, more frequent monitoring could have isolated the router problem and alerted IT to fix or replace the hardware before disaster struck.
Any company with technology from many different vendors should work closely with vendors and partners whenever possible to mitigate possible risks. Vendors often are able to provide insight into how they test their own systems and can even share test cases to assist in validating the end-to-end system. Keep in mind that aging technology is not the only problem—sometimes new technology brings untenable risks. A recent example is the Samsung Galaxy Note 7 and its overheating battery problem and recall. As well, it’s smart to understand how the local utility handles its outages and what if any help it provides to customers in the event of service disruption.
Large companies with complex IT infrastructures should take a holistic view of critical system availability, considering all of the possible ripple effects from a single point of failure. Today’s systems and networks are increasingly interconnected and interdependent. For both airlines, the effect on customers and employees lasted far beyond the actual outage, as managers struggled to get agents and planes reorganized and rescheduled. Take that risk into any spending calculations for the disaster recovery plan. Companies that have gone through mergers and acquisitions have additional complexities from IT integrations—or lack thereof. Failover strategies should ideally be incorporated upfront as part of any acquisition plans.
Comprehensive testing strategies are paramount, although sometimes overlooked, in preventing outages. Request information from core vendors, such as the networking equipment provider, on testing processes and ask to see their test cases. Though the vendor should be responsible for ensuring thorough testing of their own products, the customer is always the one to suffer the costs of a downed system. Development and testing teams should get clear on how external systems and providers handle down time and equipment failure. Strive to align the company’s disaster recovery processes with those of partners.
Finally, for companies running mission-critical systems in the cloud, there are ways to mitigate the effects of an outage. Multi-cloud management tools such as CliQr (now owned by Cisco) allow IT to quickly redeploy workloads from one cloud service to another whenever needed. Containers are also helping in this regard due to the inherent portability benefits of the technology. Companies are using containers to move applications and workloads from region to region within one cloud provider; from cloud to cloud, such as from AWS to Azure; and even from the cloud to an on-premise data center.
There are many causes of IT outages today, and as IT environments become more multifaceted, supporting more technologies, data and applications all the time, those risks grow. We don’t have an answer to the problem, but there is one strategy: leave no stone unturned. Involving testers and developers in outage prevention and disaster recovery planning is a wise move, indeed.
About the Author / Kevin Dunne
Kevin Dunne is vice president of strategy and business development at QASymphony. Prior to joining QASymphony, Dunne was a business technology analyst at Deloitte. He can be found online at Twitter and LinkedIn.