The cliché is that everyone in IT hates incidents, and the natural reaction when assembling incident response metrics is to look for numbers that you can lower over time. Fewer incidents and shorter incident response times must be better, we think. You might already be familiar with the common metrics associated with these goals, including mean time-to-resolution (MTTR) and total incident count.
As it turns out, however, these metrics actually work against your company’s resilience goals. To understand why, let’s look at the industry that birthed most of our modern incident management practices and has learned the most about how to properly apply them: the airline industry.
What the Airlines Can Teach Us
From the 1950s to the 1990s, the airline industry built and refined best practices for incident management that we have since simplified and that have been widely adopted across the tech industry. What’s often termed “modern incident management” simply means the predictable and repeatable approach to handling unexpected disasters and the learnings-driven remediation steps that are prized when building a DevOps or SRE practice.
One of the most important findings to come out of this research is the somewhat surprising fact that incidents weren’t being declared often enough. Instead, incident reports were created and the carefully assembled response and cataloging process was only followed after catastrophic events, which meant earlier opportunities to catch the underlying problem during a less severe incident were being missed.
So why did an industry with such incredibly high stakes take so long to realize that more incidents were actually a good thing? You guessed it: a misguided desire to decrease the incident count. In chasing this goal of “lowering the number,” pilots and mechanics were often reluctant to file incident reports for anything less than the most critical situations, which meant the powerful, learnings-focused response and remediation process was only triggered in response to relatively rare and critical situations.
This is a textbook example of Goodhart’s Law in action. To paraphrase, Charles Goodhart posited that any metric created will ultimately become the goal. In other words, if the incentives are aligned to reduce incidents, no matter how good your response and learning process is, it will only be executed in the most dire and unavoidable circumstances. After discovering this, the industry rapidly moved their focus towards expanding the use of the incident process by working to expand the spectrum of situations that could be classified as incidents with the goal of triggering the formal incident response process more often, distributing learnings for more incidents of lower relative severity and, yes, ultimately resetting the goal to increase the number of incidents.
This approach effectively turned Goodhart’s Law to the airlines’ advantage—any effort to game the metric(s) meant more learning, not less. Filing incidents became directly and heavily incentivized and these metrics were never used as a justification for punitive action. This resulted in the safest decade of commercial aviation on record, with over 12 billion passengers traveling and no fatal accidents.
What IT Can Learn About Successful Incident Response
As technology companies, there is much for us to learn from this exercise in achieving resilience versus chasing “lower numbers.” The primary learning is this: increasing your incident count by incentivizing the use of your incident management process for more non-critical incidents—likely your SEV2s and below—and focusing your metrics and incentives on that goal, is likely the best way to reduce your overall count of catastrophic failures.
For most companies, this means putting MTTR on the back burner and instead focusing on making the incident process itself more accessible throughout your company and across your organizations. It means leaders should talk about impactful and recent postmortems publicly and that postmortem distribution should be a priority; it should also be something that’s measured. Most importantly, it means opening up the incident management process more widely. Ideally, everyone at your company should have access to systems that allow them to file incidents, read incidents in progress and review postmortems from previous incidents.
I like to reference the “big red button” philosophy of factory floors, where the decision-making power to declare an incident is in the hands of every employee. An unsafe situation shouldn’t be up for discussion, debate or approval before being acted on, even if the situation is later determined to be less severe than originally thought. The context of why that incident was filed—why that big red button was pressed—may prevent larger, more catastrophic failures in the future.
So if you’re just getting started building out an incident management practice at your company, or you’ve been chasing MTTR metrics and focusing on pushing your incident count down as a primary goal, consider taking these learnings to heart and creating a positive incident culture with positive metrics meant to encourage and reward incident responders and reporters. Your company’s future depends on it.