According to a New Relic survey of IT professionals, the median annual cost of IT outages has reached an astronomical $7.75 million. Over one-third of the 1,700 respondents said critical business application outages now cost more than $500,000 per hour. In a business environment where most interactions with customers, suppliers and business partners are conducted digitally, downtime has become a problem and an existential threat.
Most IT organizations carefully monitor two key metrics — mean time to detection (MTTD) and mean time to resolution (MTTR) — to track their success in identifying and remediating critical systems issues. The figures represent, respectively, the average time it takes administrators to identify that a problem has occurred and to remediate the underlying issues. They serve as a benchmark for understanding the organization’s level of awareness of the status of its systems and the speed with which problems can be diagnosed and acted upon.
There are many ways to improve performance against these metrics, but three stand out as the most effective, as evidenced by New Relic and the survey results.
Monitor Everything
The less you know about a problem, the longer it takes to diagnose and fix. The New Relic survey measured 17 different observability categories, such as network monitoring, alerts, log management, browser monitoring and distributed tracing. Across the board, about two-thirds of the companies that used even one of those monitoring tools reported reduced MTTR, with many saying resolution times had fallen by more than 25%.
One-third of respondents whose organizations implemented full-stack observability—which is the ability to see everything in the tech stack that could affect the customer experience—reported the fewest outages, fastest MTTD and MTTR, lowest outage costs and highest median annual return on investment compared to all respondents. For example, those who have full-stack observability experience a median outage cost that is 37% less than those without.
As the complexity of enterprise technology stacks increases, so does the difficulty of pinpointing the root cause of an outage. Any monitoring is better than no monitoring at all, and the more visibility you have into applications and infrastructure the faster you can identify and resolve incidents. Strive for full-stack observability around integrated observability data and a comprehensive dashboard.
Have an Incident Management Action Plan
The worst time to figure out how to resolve a problem is when you’re in the midst of a crisis. Organizations with mature observability practices also have extensive and documented response plans.
Your incident plan will be unique to your organization. Small companies and startups tend to have ad hoc response strategies based on the skills within the organization. Large enterprises may use a more formal methodology like IT service management that defines strict procedures and protocols. Companies that have embraced modern DevOps and Lean principles often use a hybrid approach that depends heavily on teamwork and collaborative problem-solving. All of these approaches can work if the team fully understands the chosen strategy.
Document your IT architecture, identifying known vulnerabilities and outlining procedures for responding to the most common types of incidents. Plans should cover obvious disruptions like fires and power outages, but you should also think about “unicorn” scenarios such as a rogue query that ties up a database server or a communication outage at a key supplier. Mature observability teams devote a portion of their resources to constantly testing new and unpredictable scenarios and use chaos engineering to randomly inject problems in a controlled manner to see if systems respond as expected.
As you develop incident response procedures, document the processes in runbooks, which tell responders exactly what to do when a specific problem occurs. This not only creates a database of organizational knowledge about response strategies but also helps with onboarding new employees and covers for skills gaps that occur when team members are unavailable or leave the company.
Fix for the Future
Technical debt is a long-term obligation that is created when a short-term – and incomplete – solution is applied. You should aim to minimize it in your observability operation.
When a problem occurs, the temptation is to fix it as quickly as possible. Human nature is to move on to the next task without taking the time to go back and understand why a problem occurred. “Quick and dirty” fixes are an invitation to larger problems in the future. Your observability plan should include a forensics stage that documents root causes and the fixes that were applied. Even if it isn’t convenient to implement a long-term solution immediately, the documentation serves as a guide for implementing a permanent fix down the road.
Building a resilient IT architecture is a marathon, not a sprint. The more rigor you apply to identifying and resolving problems when they happen, the better prepared you will be to prevent their reoccurrence and the more informed your team will be about the resources they manage. Reducing your MTTD and MTTR will not only improve the bottom line of your business, it will ensure that your customers have an amazing digital experience.