DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB

Home » Blogs » Is Your Company’s IT a Disaster Waiting to Happen?

Is Your Company’s IT a Disaster Waiting to Happen?

Avatar photoBy: contributor on September 29, 2016 Leave a Comment

Recent IT disasters suggest gaps in hardware testing, backup system testing and inadequate disaster recovery plans.

Recent Posts By contributor
  • How to Ensure DevOps Success in a Distributed Network Environment
  • Dissecting the Role of QA Engineers and Developers in Functional Testing
  • DevOps Primer: Using Vagrant with AWS
Avatar photo More from contributor
Related Posts
  • Is Your Company’s IT a Disaster Waiting to Happen?
  • Cloud on the customers’ terms: iland offers maximum flexibility with new ReST API and software development kits
  • Protecting Data in Public Cloud and Hybrid Cloud
    Related Categories
  • Blogs
  • Enterprise DevOps
    Related Topics
  • disaster
  • disaster recovery
  • failover
  • failure
  • hardware
  • IT infrastructure
  • software
  • systems
Show more
Show less

Although the summer often is the most profitable season for most airlines, this past summer wasn’t great for several of the largest carriers. In July, Southwest Airlines suffered a 12-hour IT outage, triggered by a router malfunction that quickly took down many of its core operating systems like a house of cards. Worse, backup systems failed to deploy. Flight delays and cancellations continued for a few days.

TechStrong Con 2023Sponsorships Available

A similar disaster unfolded at Delta in August, when a power control module at the airline’s central data center malfunctioned, creating a surge to the transformer and loss of power. Delta’s systems didn’t switch over to backups as planned, causing scheduling mayhem as agents couldn’t access key systems. For both companies, the outages resulted in tens of millions of dollars in lost revenues.

Sure, both of these outages affected businesses running more traditional IT infrastructure, but even if your company is running in the cloud, your business isn’t immune from a damaging outage. In September, Microsoft Azure experienced two outages lasting several hours each, one in Europe and the other affecting U.S. customers across several key services, including SQL Database, Virtual Machines, Service Bus and Visual Studio team services.

As software developers and testers, we often focus narrowly on the application and the server it is running on—but there’s a lot more to consider. Hardware and components, backup systems and third-party providers all come into play when testing applications for stability and reliability. Of course, there’s no way to prevent all bad things from happening but there are some steps you can take to cover the bases more thoroughly.

Look Beyond the Software

If there were three main lessons learned from these recent IT disasters, they would be:

  1. Assess risk of the whole environment. The quality of your software is only as good as the components it runs on—the hardware, network, power, storage, and other third-party applications are often required. Understand the full picture and all of the dependencies. Categorize applications based upon business criticality and failure rate to focus attention on the pieces that really matter.
  1. Backup systems and processes are not “set and forget.” Spend more time documenting them, testing them and training staff. Make sure that the processes are up to date as IT adds new components and vendors to the system or if the company goes through a merger or acquisition.
  1. Invest according to your risk profile. If your company can’t operate for more than a few minutes without access to certain systems, plan on budgeting for warm or hot failover systems which enable near 100 percent uptime with always-on, parallel systems continually synchronized and ready to go. It’s too expensive to enable this failover capability for your entire environment—you will need to be picky. Some IT assets, such as an airline’s scheduling and reservation systems, deserve top priority and resources, while other assets such as the HR system, can do with less. IT directors also need to frequently investigate the need for upgrading old, unstable infrastructure technologies. Most companies also don’t spend enough time monitoring and testing backup and recovery systems.

To minimize the risk of a major IT outage bringing down your business and alienating customers, the recipe is as follows: Analyze the points of failure, determine what is reparable and what is not easily fixed (for example, installing software usually can be done much more quickly than building a new server environment) and, finally, install and test failover systems and processes that can maintain basic operations with minimal customer disruption.

Delta & Southwest: A Quick Analysis

In both cases, Delta and Southwest had backup systems that did not deploy properly—or, at least, in a timely fashion. The first point of action would include conducting thorough, regular testing of those backup systems and locations and upgrading capabilities if necessary. Second, since no disaster recovery system is foolproof, create manual workarounds for critical system outages. That would have helped tremendously so that customers weren’t completely stranded. Third, create a fully redundant second data center that switches over automatically when the primary data center goes down. Naturally, this is the most expensive option, but depending upon the outcome of the risk analysis, it may be worth the money.

With Southwest, a router failure caused a cascading effect that brought down the network altogether. It’s not clear if a secondary network was available for failover, although it appears the answer is no. Again, better, more frequent monitoring could have isolated the router problem and alerted IT to fix or replace the hardware before disaster struck.

Any company with technology from many different vendors should work closely with vendors and partners whenever possible to mitigate possible risks. Vendors often are able to provide insight into how they test their own systems and can even share test cases to assist in validating the end-to-end system. Keep in mind that aging technology is not the only problem—sometimes new technology brings untenable risks. A recent example is the Samsung Galaxy Note 7 and its overheating battery problem and recall. As well, it’s smart to understand how the local utility handles its outages and what if any help it provides to customers in the event of service disruption.

Planning Ahead

Large companies with complex IT infrastructures should take a holistic view of critical system availability, considering all of the possible ripple effects from a single point of failure. Today’s systems and networks are increasingly interconnected and interdependent. For both airlines, the effect on customers and employees lasted far beyond the actual outage, as managers struggled to get agents and planes reorganized and rescheduled. Take that risk into any spending calculations for the disaster recovery plan. Companies that have gone through mergers and acquisitions have additional complexities from IT integrations—or lack thereof. Failover strategies should ideally be incorporated upfront as part of any acquisition plans.

Comprehensive testing strategies are paramount, although sometimes overlooked, in preventing outages. Request information from core vendors, such as the networking equipment provider, on testing processes and ask to see their test cases. Though the vendor should be responsible for ensuring thorough testing of their own products, the customer is always the one to suffer the costs of a downed system. Development and testing teams should get clear on how external systems and providers handle down time and equipment failure. Strive to align the company’s disaster recovery processes with those of partners.

Finally, for companies running mission-critical systems in the cloud, there are ways to mitigate the effects of an outage. Multi-cloud management tools such as CliQr (now owned by Cisco) allow IT to quickly redeploy workloads from one cloud service to another whenever needed. Containers are also helping in this regard due to the inherent portability benefits of the technology. Companies are using containers to move applications and workloads from region to region within one cloud provider; from cloud to cloud, such as from AWS to Azure; and even from the cloud to an on-premise data center.

There are many causes of IT outages today, and as IT environments become more multifaceted, supporting more technologies, data and applications all the time, those risks grow. We don’t have an answer to the problem, but there is one strategy: leave no stone unturned. Involving testers and developers in outage prevention and disaster recovery planning is a wise move, indeed.

About the Author / Kevin Dunne

Kevin Dunne picKevin Dunne is vice president of strategy and business development at QASymphony. Prior to joining QASymphony, Dunne was a business technology analyst at Deloitte. He can be found online at Twitter and LinkedIn.

Filed Under: Blogs, Enterprise DevOps Tagged With: disaster, disaster recovery, failover, failure, hardware, IT infrastructure, software, systems

« Big DevOps: Too Slow to Win?
Webinar: Dashboards for DevOps: How Stratus Helps All Teams Monitor Metrics that Matter »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Evolution of Transactional Databases
Monday, January 30, 2023 - 3:00 pm EST
Moving Beyond SBOMs to Secure the Software Supply Chain
Tuesday, January 31, 2023 - 11:00 am EST
Achieving Complete Visibility in IT Operations, Analytics, and Security
Wednesday, February 1, 2023 - 11:00 am EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

Stream Big, Think Bigger: Analyze Streaming Data at Scale
January 27, 2023 | Julia Brouillette
What’s Ahead for the Future of Data Streaming?
January 27, 2023 | Danica Fine
The Strategic Product Backlog: Lead, Follow, Watch and Explore
January 26, 2023 | Chad Sands
Atlassian Extends Automation Framework’s Reach
January 26, 2023 | Mike Vizard
Software Supply Chain Security Debt is Increasing: Here’s How To Pay It Off
January 26, 2023 | Bill Doerrfeld

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

What DevOps Needs to Know About ChatGPT
January 24, 2023 | John Willis
Microsoft Outage Outrage: Was it BGP or DNS?
January 25, 2023 | Richi Jennings
Five Great DevOps Job Opportunities
January 23, 2023 | Mike Vizard
Optimizing Cloud Costs for DevOps With AI-Assisted Orchestra...
January 24, 2023 | Marc Hornbeek
A DevSecOps Process for Node.js Projects
January 23, 2023 | Gilad David Maayan
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.