There are easier ways to manage incident response without creating war rooms and packing IT staff onto bridge calls
Your phone vibrates at 11 p.m., and you know that can only mean another major incident with one of the business’ critical systems. You get geared up for the war room, dial into the bridge call and start reviewing the major incident report. You do this instinctively because it’s the third major incident in as many weeks, and you wonder if there’s a faster, easier, less stressful way to keep critical systems running, especially during periods of operational uncertainties.
You’re not the only one recognizing that having too many major incidents is a significant issue. In a recent survey on the future of monitoring and AIOps, 94% of respondents stated that issue resolution is critical to their business, but only 28% are satisfied with their handling of major incidents. Too many organizations feel that their only path toward improving their incident handling capabilities involves redesigning their critical business applications, facing down their technical debt or building out an SRE practice.
There are easier ways to manage major incidents without creating war rooms and packing IT staff onto bridge calls. When used well, AIOps techniques can address the pain points in resolving major incidents that drove business leaders and CIOs to establish war rooms and bridge calls in the first place. To understand what AIOps capabilities are needed, it helps to review the history of war rooms and bridge calls and why they are inefficient in solving today’s operational challenges.
Furthermore, IT organizations can take advantage of AIOps platforms much faster than re-architecting applications, addressing technical debt,or hiring more site reliability engineers.
How Bridge Calls Became the Status Quo in Incident Response
No one wants unreliable or poor performing business systems. It’s why IT teams have become extremely proficient at recovering from incidents, especially the ones that are easy to diagnose and easy to resolve. Problems such as web servers going down, databases running out of storage or services stuck in deadlocks are relatively easy to diagnose with today’s monitoring tools. In fact, over the last few years, many IT Ops groups have used tools to automate the recovery from these common issues.
But more complex issues are harder to solve. Issues such as:
- Problems causing a cascading failure of dependent systems that are all sending frequent alerts.
- Issues right after major application deployments or infrastructure changes.
- Bottlenecks in customer-facing applications that are experiencing unusually heavy loads.
IT’s history of solving these complex incidents isn’t great. The incident management team responds and calls in help from Tier One support. With so many things going wrong, operational teams have no choice but to call in higher levels of support, including developers. By the time someone communicates to the business on the incident’s status, leaders are irate over the lack of communication, and the time it is taking to recover from the issue.
CIOs and IT leaders dislike being yelled at and seeing long outages. Their easiest management response is to get all the experts in the room, often called the war room, in the hope that having more people involved is better. War rooms often include bridge calls to allow remote people to attend, and bridge calls are standard practice for major incidents occurring off-hours.
Bridge Calls Don’t Solve Incident Management Problems Better
It surely isn’t better for the operational engineers to be responding to off-hour incidents regularly. It’s not better if a seasoned major incident manager is needed to oversee the bridge call and ensure differing opinions don’t escalate into arguments or finger-pointing. It isn’t better if the recovery times miss business objectives and if root causes are never identified. It isn’t better if the resolution times to recurring problems don’t improve.
It’s also really bad if the number of complex issues is increasing. This is likely to be the case as application architectures based on hybrid cloud or containers and microservices add complexity while frequent releases from DevOps teams add to risk. Also, global events such as COVID-19 create usage uncertainties and network bottlenecks during periods of increased business importance.
Solving for the Root Cause of Inefficiencies
Bridge calls and war rooms are inefficient practices to solve complex problems. They require too many people, take too long and require too many more follow-ups to identify the root cause.
IT has built up an arsenal of tools to better manage myriad operational domains. We’re using one set of monitoring tools for the data center and a separate one for the public cloud. Every database, API, microservice, application component and type of IoT device has different tools to monitor performance. In fact, according to the future of monitoring and AIOps survey, about 20% of respondents reported having 25 or more monitoring tools. In addition, there’s also all the system, network and application log files that often have the most critical information when troubleshooting a complex issue.
That’s many tools, considerable data and countless alerts to sift through under the pressure of a major incident. These tools often require different skills and subject matter expert participation, which is another reason why bridge calls and war rooms are so crowded.
Now, AIOps can mean many different things to different people. But if it’s going to improve major incident recoveries without requiring bridge calls or war rooms, then it must:
- Aggregate all the data from these monitoring tools.
- Aggregate data from all of your change and topology tools, so you can find out which changes led to an incident or what’s affected by the incident.
- Sequence the data into a time series showing what issues came first.
- Correlate the hundreds or even thousands of alerts into discrete events that operators should review.
- Present the information in an easy console for operators to assess during major incidents.
- Enable engineers to automate steps in the recovery.
- Integrate with different workflow, collaboration and communication tools to automate incident response steps.
- Share the event sequence with engineering and development teams.
During incidents, the differentiator is that machine learning models have gotten a head start in processing all the data. Instead of dozens of tools requiring lots of people, the start of the incident review can begin by examining information in a single platform. Instead of IT Ops being inundated with too much noisy data and alerts, machine learning has already processed and correlated events into a more easily decipherable storyline. Instead of manual recovery steps, automated steps can be centralized and orchestrated from a single console. Lastly, since the organization’s workflow, collaboration and communication tools are integrated, no one should feel out of the loop on issue status and postmortem steps to make improvements.
IT teams often celebrate when a legacy system is shut down because operations have fully transitioned to a modernized platform. Isn’t it time to do the same with war rooms and bridge calls?