Most folks working in DevOps or SRE roles are familiar with metrics like mean-time-to-recovery (MTTR). Keeping track of the average time a team takes to respond to incidents is crucial to identifying bottlenecks in the support process. It’s also something executives like to show higher-ups when sharing a snapshot of overall platform performance. However, focusing on one single metric might be missing the greater picture.
For example, how long did it take to discover the incident? How long did it take from discovery until action was taken? What was the timeframe between filing a ticket and updating all clients that had a bug? When can you say you’ve completely resolved an issue? As you can see, there are many potential metrics that could inform software reliability and the platform engineering process. “One number is never going to tell you a complete story,” said Emily Arnott, community relations manager, Blameless.
I recently met with the Blameless team for a closer look into mean-time reliability metrics. Below, we’ll explore the nuances behind five different types of MTTX metrics and consider the business value of keeping tabs on each type.
Mean-time-to-detect is a measurement of how long it takes, on average, to detect that an incident is present. This often happens automatically within a monitoring system. Perhaps a tool sends out an alert when latency meets a certain threshold, for example. But, detection could also come from other sources, such as a customer complaint.
For example, consider Log4j. Perhaps a runtime vulnerability scanning tool noticed one of your components is impacted by a novel CVE and automatically sent a notification to the appropriate team channel.
Now, just because an incident is detected doesn’t mean it’s immediately acknowledged. Mean-time-to-acknowledge, then, is a measurement of how long it takes a human being to realize the incident and begin to act on it.
In our CVE example, this would be the time between vulnerability detection and initial response. For example, the on-call incident manager received a vulnerability notification, read the exposure details and then filed a JIRA ticket and contacted the relevant team members.
Many factors might extend mean-time-to-acknowledge. Perhaps someone isn’t logged into Slack, or a hardware issue stalls the notification. Alert fatigue may also get in the way, or there may be a reluctance to file a ticket and introduce more toil. Depending on the severity of the incident, someone simply might not believe the alert is worth responding to.
Now, the following metrics are a little more open to interpretation, but we’ll do our best to define each. Mean-time-to-recover is the average time it takes to introduce a temporary fix after discovering an incident.
For example, if a particular region is experiencing outages, engineers might temporarily divert traffic to a more stable server. This is not a permanent fix, but the system recovers and operations are generally unaffected. This helps maintain the status quo while a more permanent solution is ideated, tested and applied.
Mean-time-to-repair (or restore) is the mean time it takes to issue a permanent repair to a system after discovering an incident. In order for a system to be considered fully repaired, it has to not just be working, but working robustly.
For example, let’s say the incident in question involves performance issues. Perhaps a patch is issued to the core branch to remove bulky code that’s causing slow load times for clients. Mean-time-to-repair introduces a more permanent solution but still might not be the complete resolution to the problem.
Mean-time-to-resolve can be thought of as the average time between when an incident occurs to when the issue is completely resolved. Not only is the core codebase patched, but all clients reliant upon the software have been updated, as well. Lessons are learned and mitigation plans are set to respond to similar incidents or vulnerabilities in the future.
Mean-time-to-resolve is about resolving the incident entirely, said Jake Englund, senior site reliability engineer, Blameless. This includes addressing the underlying fundamental contributors, completing all logs that remain on the back burner and following up with a retrospective.
Using Mean-Time Metrics
Mean-time metrics can provide a quantitative picture of incident response performance, which can be valuable for overall business operations. Mean-time-to-acknowledge, for example, can expose gaps in the remediation process, such as cognitive strife in reporting incidents, said Matt Davis, intuition engineer, Blameless. Understanding these technical and human factors are the first step to making the resolution process swifter.
Of course, the above metrics rely on incident data, which may not always be a top priority. According to Davis, encouraging a culture that declares incidents—even minor ones such as a configuration change—can improve knowledge sharing within a team. “If you declare an incident, you could enact more systemic change,” added Arnott.
Limitations of Mean-Time Metrics
These mean-time metrics do have some limitations. “A number is only one part of the story,” said Davis. As a result, teams might struggle with deciding precisely what to detect. MTTR can be a helpful metric, but it’s the context that matters, he explained. Therefore, tracking multiple metrics can help provide a more sophisticated, nuanced picture. This involves looking beyond averages to consider outlier events, added Englund.
There are also semantic nuances between the MTTX metrics defined above. “There’s a lot of ambiguity around these words,” said Davis. As a result, organizations might compute each figure differently. Some of these figures use time-markers which may not be technically possible to track consistently, especially since each incident is unique. Demarcating the precise moment an incident began might require guestimation. You might know when a service is fully restored, but the lasting customer perception is harder to gauge.
Also, another potential downside is that mean-time metrics could easily be manipulated or misinterpreted, whether deliberately or inadvertently. Operations leads might selectively recall specific windows when showing MTTX metrics to higher-ups, leaving out other statistics that paint a different picture.
Treat MTTX As A Guidepost
Improving incident response is becoming mission-critical to maintain fully-functional systems as things like outages, downtime, slow speeds and zero-day vulnerabilities can negatively impact user experience. Sometimes, these issues must be addressed immediately to maintain SLAs.
But tracking reliability averages isn’t all that simple, and they will likely mean something different to each organization. “It’s about embracing complexity and asking the right questions,” said Davis.
In summary, mean-time reliability metrics provide helpful insight into the ongoing state of incident response. Yet, such metrics shouldn’t be imposed as a strict target — instead, they should be viewed as an informative guidepost. “Metrics can help you find what is discussion-worthy, but it’s not a discussion itself,” said Arnott.