DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • Where Does Observability Stand Today, and Where is it Going Next?
  • Five Great DevOps Job Opportunities
  • A Freelancer's Workflow
  • Azure Migration Strategy: Tools, Costs and Best Practices
  • Blameless Integrates Incident Management Platform With Opsgenie

Home » Blogs » Doin' DevOps » Reducing Incident Resolution Time

Reducing Incident Resolution Time

Avatar photoBy: David Shackelford on March 31, 2015 Leave a Comment

From raw incident count to response time, there are several key metrics that top Operations teams track to measure and improve their performance. One of the most popular metrics teams track for is mean time to resolution (MTTR). It’s the time between failure and recovery from failure, and is directly linked to your uptime. While MTTR may be the gold standard when it comes to operational readiness, it’s important for teams to look at the bigger picture to effectively decrease incident resolution time.

Recent Posts By David Shackelford
  • Multi-Tiered Escalation Policies are Paramount
  • Four Metrics for Driving Cultural Change in DevOps Teams
  • Do You Need Your Ticketing System for Real-Time Incident Management?
Avatar photo More from David Shackelford
Related Posts
  • Reducing Incident Resolution Time
  • DevOps storytime: MTTR vs. Goodheart’s Law
  • In DevOps, Business and Operations Metrics Both Matter
    Related Categories
  • Blogs
  • Doin' DevOps
    Related Topics
  • incident response
Show more
Show less

Putting MTTR into perspective
Your overall downtime is a function of the number of outages as well as the length of each. Dan Slimmon does a great job discussing these two factors and how you may want to think about prioritizing them. Depending on your situation, it may be more important to minimize noisy alerts that resolve quickly (meaning your MTTR may actually increase when you do this). But if you’ve identified MTTR as an area for improvement, here are some strategies that may help.

TechStrong Con 2023Sponsorships Available

Working faster won’t solve the problem
It would be nice if we could fix outages faster simply by working faster, but we all know this isn’t true. To make sustainable, measurable improvements to your MTTR, you need to do a deep investigation into what happens during an outage. True – there will always be variability in your resolution time due to the complexity of incidents. But taking a look at your processes is a good place to start – often the key to shaving minutes lies in how your people and systems work together.

Check out your RESPONSE time
Some call it MTTA (mean time to acknowledge) or MTTR (same acronym, but meaning mean time to respond), but the clock starts ticking as soon as an incident is triggered, and with adjustments to your notification processes, you may be able to achieve some quick wins.

If your response time is on the longer side, you may want to look at how the team is being alerted. Do alerts reliably reach the right person? If the first person notified does not respond, can the alerts automatically be escalated, and how much time do you really need to wait before moving on? Setting the right expectations and goals around response time can help ensure that all team members are responding to their alerts as quickly as possible.

Establish a process for outages
An outage is a stressful time, and it’s not when you want to be figuring out how you respond to incidents. Establish a process (even if it’s not perfect at first) so everyone knows what to do. When you’re designing your process for responding to incidents, make sure you have the following elements in place:

  1. Establish a communication protocol – if the incident is something more than one person needs to work on, make sure everyone understands where they need to be. Conference calls or Google Hangouts are a good idea, or a room in HipChat or Slack.
  2. Establish a leader – this is the person who will be directing the work of the team in resolving the outage. They will be taking notes and giving orders. If the rest of the team disagrees, the leader can be voted out, but another leader should be established immediately.
  3. Take great notes – about everything that’s happening during the outage. These notes will be a helpful reference when you look back during the post mortem. At PagerDuty, some of our call leaders like using a paper notebook beside their laptop as a visual reminder that they should be recording everything.
  4. Practice makes perfect – if you’re not having frequent outages practice your incident response plan monthly to make sure the team is well-versed. Also, don’t forget to train new-hires on the process.

To learn more, check out this talk about incident management from Blake Gentry, a former lead software engineer at Heroku.

Find and fix the problem
Finding out what is  actually going wrong is often the lion’s share of your resolution time. It’s critical to have instrumentation and analytics for each of your services, and make sure that information helps you identify what’s going wrong. For problems that are somewhat common and well understood, you may be able to implement automated fixes.

Filed Under: Blogs, Doin' DevOps Tagged With: incident response

« Continuous Documentation
DevOps at REA Group »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Automating Day 2 Operations: Best Practices and Outcomes
Tuesday, February 7, 2023 - 3:00 pm EST
Shipping Applications Faster With Kubernetes: Myth or Reality?
Wednesday, February 8, 2023 - 1:00 pm EST
Why Current Approaches To "Shift-Left" Are A DevOps Antipattern
Thursday, February 9, 2023 - 1:00 pm EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

Where Does Observability Stand Today, and Where is it Going Next?
February 6, 2023 | Tomer Levy
Five Great DevOps Job Opportunities
February 6, 2023 | Mike Vizard
Azure Migration Strategy: Tools, Costs and Best Practices
February 3, 2023 | Gilad David Maayan
Blameless Integrates Incident Management Platform With Opsgenie
February 3, 2023 | Mike Vizard
OpenAI Hires 1,000 Low Wage Coders to Retrain Copilot | Netflix Blocks Password Sharing
February 2, 2023 | Richi Jennings

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

OpenAI Hires 1,000 Low Wage Coders to Retrain Copilot | Netflix Blocks Password Sharing
February 2, 2023 | Richi Jennings
Automation Challenges Holding DevOps Back
February 1, 2023 | Mike Vizard
Jellyfish Adds Tool to Visualize Software Development Workflows
January 31, 2023 | Mike Vizard
Cisco AppDynamics Survey Surfaces DevSecOps Challenges
January 31, 2023 | Mike Vizard
Red Hat Brings Ansible Automation to Google Cloud
February 2, 2023 | Mike Vizard
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.