DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • DevOps Chats
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Communities
    • AWS Community Hub
    • CloudBees
    • IT as Code
    • Rocket on DevOps.com
    • Traceable on DevOps.com
    • Quali on DevOps.com
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Digital Anarchist
  • Media Kit
  • About
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DevSecOps
  • DevOps Onramp
  • Practices
  • ROELBOB
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps

Home » Features » 5 Mean-Time Reliability Metrics To Follow

mean-time incident responselog4j supply chain

5 Mean-Time Reliability Metrics To Follow

By: Bill Doerrfeld on July 7, 2022 Leave a Comment

Most folks working in DevOps or SRE roles are familiar with metrics like mean-time-to-recovery (MTTR). Keeping track of the average time a team takes to respond to incidents is crucial to identifying bottlenecks in the support process. It’s also something executives like to show higher-ups when sharing a snapshot of overall platform performance. However, focusing on one single metric might be missing the greater picture.

For example, how long did it take to discover the incident? How long did it take from discovery until action was taken? What was the timeframe between filing a ticket and updating all clients that had a bug? When can you say you’ve completely resolved an issue? As you can see, there are many potential metrics that could inform software reliability and the platform engineering process. “One number is never going to tell you a complete story,” said Emily Arnott, community relations manager, Blameless.

AppSec/API Security 2022

I recently met with the Blameless team for a closer look into mean-time reliability metrics. Below, we’ll explore the nuances behind five different types of MTTX metrics and consider the business value of keeping tabs on each type.

1. Mean-Time-To-Detect

Mean-time-to-detect is a measurement of how long it takes, on average, to detect that an incident is present. This often happens automatically within a monitoring system. Perhaps a tool sends out an alert when latency meets a certain threshold, for example. But, detection could also come from other sources, such as a customer complaint.

For example, consider Log4j. Perhaps a runtime vulnerability scanning tool noticed one of your components is impacted by a novel CVE and automatically sent a notification to the appropriate team channel.

2. Mean-Time-To-Acknowledge

Now, just because an incident is detected doesn’t mean it’s immediately acknowledged. Mean-time-to-acknowledge, then, is a measurement of how long it takes a human being to realize the incident and begin to act on it.

In our CVE example, this would be the time between vulnerability detection and initial response. For example, the on-call incident manager received a vulnerability notification, read the exposure details and then filed a JIRA ticket and contacted the relevant team members.

Many factors might extend mean-time-to-acknowledge. Perhaps someone isn’t logged into Slack, or a hardware issue stalls the notification. Alert fatigue may also get in the way, or there may be a reluctance to file a ticket and introduce more toil. Depending on the severity of the incident, someone simply might not believe the alert is worth responding to.

3. Mean-Time-To-Recover

Now, the following metrics are a little more open to interpretation, but we’ll do our best to define each. Mean-time-to-recover is the average time it takes to introduce a temporary fix after discovering an incident.

For example, if a particular region is experiencing outages, engineers might temporarily divert traffic to a more stable server. This is not a permanent fix, but the system recovers and operations are generally unaffected. This helps maintain the status quo while a more permanent solution is ideated, tested and applied.

4. Mean-Time-To-Repair

Mean-time-to-repair (or restore) is the mean time it takes to issue a permanent repair to a system after discovering an incident. In order for a system to be considered fully repaired, it has to not just be working, but working robustly.

For example, let’s say the incident in question involves performance issues. Perhaps a patch is issued to the core branch to remove bulky code that’s causing slow load times for clients. Mean-time-to-repair introduces a more permanent solution but still might not be the complete resolution to the problem.

5. Mean-Time-To-Resolve

Mean-time-to-resolve can be thought of as the average time between when an incident occurs to when the issue is completely resolved. Not only is the core codebase patched, but all clients reliant upon the software have been updated, as well. Lessons are learned and mitigation plans are set to respond to similar incidents or vulnerabilities in the future.

Mean-time-to-resolve is about resolving the incident entirely, said Jake Englund, senior site reliability engineer, Blameless. This includes addressing the underlying fundamental contributors, completing all logs that remain on the back burner and following up with a retrospective.

Using Mean-Time Metrics

Mean-time metrics can provide a quantitative picture of incident response performance, which can be valuable for overall business operations. Mean-time-to-acknowledge, for example, can expose gaps in the remediation process, such as cognitive strife in reporting incidents, said Matt Davis, intuition engineer, Blameless. Understanding these technical and human factors are the first step to making the resolution process swifter.

Of course, the above metrics rely on incident data, which may not always be a top priority. According to Davis, encouraging a culture that declares incidents—even minor ones such as a configuration change—can improve knowledge sharing within a team. “If you declare an incident, you could enact more systemic change,” added Arnott.

Limitations of Mean-Time Metrics

These mean-time metrics do have some limitations. “A number is only one part of the story,” said Davis. As a result, teams might struggle with deciding precisely what to detect. MTTR can be a helpful metric, but it’s the context that matters, he explained. Therefore, tracking multiple metrics can help provide a more sophisticated, nuanced picture. This involves looking beyond averages to consider outlier events, added Englund.

There are also semantic nuances between the MTTX metrics defined above. “There’s a lot of ambiguity around these words,” said Davis. As a result, organizations might compute each figure differently. Some of these figures use time-markers which may not be technically possible to track consistently, especially since each incident is unique. Demarcating the precise moment an incident began might require guestimation. You might know when a service is fully restored, but the lasting customer perception is harder to gauge.

Also, another potential downside is that mean-time metrics could easily be manipulated or misinterpreted, whether deliberately or inadvertently. Operations leads might selectively recall specific windows when showing MTTX metrics to higher-ups, leaving out other statistics that paint a different picture.

Treat MTTX As A Guidepost

Improving incident response is becoming mission-critical to maintain fully-functional systems as things like outages, downtime, slow speeds and zero-day vulnerabilities can negatively impact user experience. Sometimes, these issues must be addressed immediately to maintain SLAs.

But tracking reliability averages isn’t all that simple, and they will likely mean something different to each organization. “It’s about embracing complexity and asking the right questions,” said Davis.

In summary, mean-time reliability metrics provide helpful insight into the ongoing state of incident response. Yet, such metrics shouldn’t be imposed as a strict target — instead, they should be viewed as an informative guidepost. “Metrics can help you find what is discussion-worthy, but it’s not a discussion itself,” said Arnott.

Recent Posts By Bill Doerrfeld
  • Open Standards Are Key For Realizing Observability
  • Leverage Empirical Data to Avoid DevOps Burnout
  • What Are the Seven Layers of the OSI Model?
More from Bill Doerrfeld
Related Posts
  • 5 Mean-Time Reliability Metrics To Follow
  • SRE vs. DevOps — a False Distinction?
  • SRE Vs. DevOps: The Wrong Question?
    Related Categories
  • Application Performance Management/Monitoring
  • Continuous Delivery
  • Continuous Testing
  • DevOps Culture
  • DevOps Onramp
  • Editorial Calendar
  • Features
  • Infrastructure/Networking
  • Observability
  • SRE
    Related Topics
  • mean time to repair
  • mean time to restore
  • metrics
  • MTTX
  • SRE
Show more
Show less

Filed Under: Application Performance Management/Monitoring, Continuous Delivery, Continuous Testing, DevOps Culture, DevOps Onramp, Editorial Calendar, Features, Infrastructure/Networking, Observability, SRE Tagged With: mean time to repair, mean time to restore, metrics, MTTX, SRE

Sponsored Content
Featured eBook
The State of the CI/CD/ARA Market: Convergence

The State of the CI/CD/ARA Market: Convergence

The entire CI/CD/ARA market has been in flux almost since its inception. No sooner did we find a solution to a given problem than a better idea came along. The level of change has been intensified by increasing use, which has driven changes to underlying tools. Changes in infrastructure, such ... Read More
« State Management
Environments-as-a-Service: Free Your Devs »

TechStrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Bring Your Mission-Critical Data to Your Cloud Apps and Analytics
Tuesday, August 16, 2022 - 11:00 am EDT
Mistakes You Are Probably Making in Kubernetes
Tuesday, August 16, 2022 - 1:00 pm EDT
Taking Your SRE Team to the Next Level
Tuesday, August 16, 2022 - 3:00 pm EDT

Latest from DevOps.com

Building a Platform for DevOps Evolution, Part One
August 16, 2022 | Bob Davis
Techstrong TV: Leveraging Low-Code Technology with Tools & Digital Transformation
August 15, 2022 | Mitch Ashley
Five Great DevOps Job Opportunities
August 15, 2022 | Mike Vizard
Dynatrace Extends Reach of Application Security Module
August 15, 2022 | Mike Vizard
The Rogers Outage of 2022: Takeaways for SREs
August 15, 2022 | JP Cheung

Get The Top Stories of the Week

  • View DevOps.com Privacy Policy
  • This field is for validation purposes and should be left unchanged.

Download Free eBook

The State of the CI/CD/ARA Market: Convergence
https://library.devops.com/the-state-of-the-ci/cd/ara-market

Most Read on DevOps.com

MLOps Vs. DevOps: What’s the Difference?
August 10, 2022 | Gilad David Maayan
We Must Kill ‘Dinosaur’ JavaScript | Microsoft Open Sources ...
August 11, 2022 | Richi Jennings
CREST Defines Quality Verification Standard for AppSec Testi...
August 9, 2022 | Mike Vizard
GitHub Brings 2FA to JavaScript Package Manager
August 9, 2022 | Mike Vizard
What GitHub’s 2FA Mandate Means for Devs Everywhere
August 11, 2022 | Doug Kersten

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2022 ·Techstrong Group, Inc.All rights reserved.