DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • 5 Unusual Ways to Improve Code Quality
  • Bug Bounty Vs. Crowdtesting Programs
  • Five Great DevOps Job Opportunities
  • Items of Value
  • Grafana Labs Acquires Pyroscope to Add Code Profiling Capability

Home » Blogs » Leading Effective Incident Response Without Interminable Bridge Calls

Leading Effective Incident Response Without Interminable Bridge Calls

Avatar photoBy: Isaac Sacolick on December 15, 2020 Leave a Comment

There are easier ways to manage incident response without creating war rooms and packing IT staff onto bridge calls

Recent Posts By Isaac Sacolick
  • How IT Ops Can Exceed Service Level Objectives in Digital Transformations
Avatar photo More from Isaac Sacolick
Related Posts
  • Leading Effective Incident Response Without Interminable Bridge Calls
  • Embracing a culture of continuous incident response
  • When IT Disaster Strikes, Part 1: Resolving Incidents
    Related Categories
  • Blogs
  • DevOps Practice
  • Doin' DevOps
    Related Topics
  • bridge call
  • crisis management
  • incident
  • incident response
  • war room
Show more
Show less

Your phone vibrates at 11 p.m., and you know that can only mean another major incident with one of the business’ critical systems. You get geared up for the war room, dial into the bridge call and start reviewing the major incident report. You do this instinctively because it’s the third major incident in as many weeks, and you wonder if there’s a faster, easier, less stressful way to keep critical systems running, especially during periods of operational uncertainties.

You’re not the only one recognizing that having too many major incidents is a significant issue. In a recent survey on the future of monitoring and AIOps, 94% of respondents stated that issue resolution is critical to their business, but only 28% are satisfied with their handling of major incidents. Too many organizations feel that their only path toward improving their incident handling capabilities involves redesigning their critical business applications, facing down their technical debt or building out an SRE practice.

There are easier ways to manage major incidents without creating war rooms and packing IT staff onto bridge calls. When used well, AIOps techniques can address the pain points in resolving major incidents that drove business leaders and CIOs to establish war rooms and bridge calls in the first place. To understand what AIOps capabilities are needed, it helps to review the history of war rooms and bridge calls and why they are inefficient in solving today’s operational challenges.

Furthermore, IT organizations can take advantage of AIOps platforms much faster than re-architecting applications, addressing technical debt,or hiring more site reliability engineers.

How Bridge Calls Became the Status Quo in Incident Response

No one wants unreliable or poor performing business systems. It’s why IT teams have become extremely proficient at recovering from incidents, especially the ones that are easy to diagnose and easy to resolve. Problems such as web servers going down, databases running out of storage or services stuck in deadlocks are relatively easy to diagnose with today’s monitoring tools. In fact, over the last few years, many IT Ops groups have used tools to automate the recovery from these common issues.

But more complex issues are harder to solve. Issues such as:

  • Problems causing a cascading failure of dependent systems that are all sending frequent alerts.
  • Issues right after major application deployments or infrastructure changes.
  • Bottlenecks in customer-facing applications that are experiencing unusually heavy loads.

IT’s history of solving these complex incidents isn’t great. The incident management team responds and calls in help from Tier One support. With so many things going wrong, operational teams have no choice but to call in higher levels of support, including developers. By the time someone communicates to the business on the incident’s status, leaders are irate over the lack of communication, and the time it is taking to recover from the issue.

CIOs and IT leaders dislike being yelled at and seeing long outages. Their easiest management response is to get all the experts in the room, often called the war room, in the hope that having more people involved is better. War rooms often include bridge calls to allow remote people to attend, and bridge calls are standard practice for major incidents occurring off-hours.

Bridge Calls Don’t Solve Incident Management Problems Better

It surely isn’t better for the operational engineers to be responding to off-hour incidents regularly. It’s not better if a seasoned major incident manager is needed to oversee the bridge call and ensure differing opinions don’t escalate into arguments or finger-pointing. It isn’t better if the recovery times miss business objectives and if root causes are never identified. It isn’t better if the resolution times to recurring problems don’t improve.

It’s also really bad if the number of complex issues is increasing. This is likely to be the case as application architectures based on hybrid cloud or containers and microservices add complexity while frequent releases from DevOps teams add to risk. Also, global events such as COVID-19 create usage uncertainties and network bottlenecks during periods of increased business importance.

Solving for the Root Cause of Inefficiencies

Bridge calls and war rooms are inefficient practices to solve complex problems. They require too many people, take too long and require too many more follow-ups to identify the root cause.

IT has built up an arsenal of tools to better manage myriad operational domains. We’re using one set of monitoring tools for the data center and a separate one for the public cloud. Every database, API, microservice, application component and type of IoT device has different tools to monitor performance. In fact, according to the future of monitoring and AIOps survey, about 20% of respondents reported having 25 or more monitoring tools. In addition, there’s also all the system, network and application log files that often have the most critical information when troubleshooting a complex issue.

That’s many tools, considerable data and countless alerts to sift through under the pressure of a major incident. These tools often require different skills and subject matter expert participation, which is another reason why bridge calls and war rooms are so crowded.

Now, AIOps can mean many different things to different people. But if it’s going to improve major incident recoveries without requiring bridge calls or war rooms, then it must:

  • Aggregate all the data from these monitoring tools.
  • Aggregate data from all of your change and topology tools, so you can find out which changes led to an incident or what’s affected by the incident.
  • Sequence the data into a time series showing what issues came first.
  • Correlate the hundreds or even thousands of alerts into discrete events that operators should review.
  • Present the information in an easy console for operators to assess during major incidents.
  • Enable engineers to automate steps in the recovery.
  • Integrate with different workflow, collaboration and communication tools to automate incident response steps.
  • Share the event sequence with engineering and development teams.

During incidents, the differentiator is that machine learning models have gotten a head start in processing all the data. Instead of dozens of tools requiring lots of people, the start of the incident review can begin by examining information in a single platform. Instead of IT Ops being inundated with too much noisy data and alerts, machine learning has already processed and correlated events into a more easily decipherable storyline. Instead of manual recovery steps, automated steps can be centralized and orchestrated from a single console. Lastly, since the organization’s workflow, collaboration and communication tools are integrated, no one should feel out of the loop on issue status and postmortem steps to make improvements.

IT teams often celebrate when a legacy system is shut down because operations have fully transitioned to a modernized platform. Isn’t it time to do the same with war rooms and bridge calls?

Filed Under: Blogs, DevOps Practice, Doin' DevOps Tagged With: bridge call, crisis management, incident, incident response, war room

« Community Forums for Technical Support: Best Effort vs. Best-in-Class
Instana Adds ARM-Based Host Support for Monitoring and Tracing PHP Services on AWS Graviton2 »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

How Atlassian Scaled a Developer Security Solution Across Thousands of Engineers
Tuesday, March 21, 2023 - 1:00 pm EDT
The Testing Diaries: Confessions of an Application Tester
Wednesday, March 22, 2023 - 11:00 am EDT
The Importance of Adopting Modern AppSec Practices
Wednesday, March 22, 2023 - 1:00 pm EDT

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

5 Unusual Ways to Improve Code Quality
March 20, 2023 | Gilad David Maayan
Bug Bounty Vs. Crowdtesting Programs
March 20, 2023 | Rob Mason
Five Great DevOps Job Opportunities
March 20, 2023 | Mike Vizard
Items of Value
March 20, 2023 | ROELBOB
Grafana Labs Acquires Pyroscope to Add Code Profiling Capability
March 17, 2023 | Mike Vizard

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

SVB: When Silly Valley Sneezes, DevOps Catches a Cold
March 14, 2023 | Richi Jennings
Low-Code Should be Worried About ChatGPT
March 14, 2023 | Romy Hughes
Large Organizations Are Embracing AIOps
March 16, 2023 | Mike Vizard
Addressing Software Supply Chain Security
March 15, 2023 | Tomislav Pericin
Understanding Cloud APIs
March 14, 2023 | Katrina Thompson
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.