DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • Postman Releases Tool for Building Apps Using APIs
  • What DevOps Leadership Should Look Like
  • Things We Should Acknowledge, Part One: Hiring Sucks
  • HPE to Acquire OpsRamp to Gain AIOps Platform
  • Oracle Makes Java 20 Platform Generally Available

Home » Blogs » SRE’s Guide to Pragmatic Incident Response

SRE’s Guide to Pragmatic Incident Response

Avatar photoBy: Bobby Ross on September 7, 2021 Leave a Comment

In my past experience as an SRE, I learned some valuable lessons about how to respond to and learn from incidents. If you want the TL;DR, I’ll summarize them here:

  • Declare and run retros for the small incidents. It’s less stressful, and action items become much more actionable.
  • Decrease the time it takes to analyze an incident. You’ll remember more, and will learn more from the incident.
  • Alert on pain felt by people – not machines. The only reason we declare incidents at all is because of the people on the other side of those machines.

Now, let’s dive into each of these lessons a little deeper, and explore how they can help you build a better system for pragmatic incident response.

1. Focus on the Small Incidents First

The bad habit of ignoring small issues often leads to bigger issues. You should run retrospectives for small incidents (slowdowns, minor bugs, etc.) because these often have the most actionable takeaways—instead of shooting the moon and creating “rearchitect async pipeline” as a Jira ticket that never happens. Focusing on low-stakes incidents and retrospectives are a great introduction to behavior change across your organization.

Let’s look at an obvious example of how a focus on small incidents can have an outsized impact. Case in point—my apartment. I live in an old candy factory retrofitted into apartment units. We have an elevator (thank God), but some of the buttons don’t light up when you press them. The LED display doesn’t match up with the numbers on the floor buttons. Yes, the elevator goes up and down, but overall, you can tell that things are wrong with it.

One day, I came home and noticed an “Out of Order” notice on the elevator door. Was I surprised? Not at all. Of course an elevator with mislabeled buttons and broken LEDs would stop working!

This isn’t dissimilar from an important software lesson—ignoring small issues often leads to much bigger issues. What starts as an inconvenience can lead to a complete breakdown. This is why you should focus on the small incidents first. I hear a lot of companies say, “We need to fix incident response! Every time we have an incident, it’s just chaos!”

They are talking about unexpected downtime during a high severity level incident and they want to improve it.

I say that’s the wrong type of incident to use to fix your incident response system and process. Focus on the small incidents first. Running retros for small incidents can help you build strong incident response models, because they have the most actionable takeaways and they are the best way to change behavior through repetition and practice.

If you have a high-stakes, 12-hour incident and you run a one-hour retrospective you’re not going to get the results you want. You need to start small. Run retros for bugs that were introduced; maybe a bad data migration that didn’t really impact anything but took up a couple of hours of your day.

Heidi Waterhouse captured this idea really well in this piece on reliability. Every airplane you’ve ever flown on has many tiny problems, Waterhouse said, “… like a sticky luggage latch or a broken seat or a frayed seatbelt. None of these problems alone are cause to ground the plane. However, if enough small problems compound, the plane may no longer meet the requirements for passenger airworthiness and the airline will ground it. A plane with many malfunctioning call buttons may also be poorly maintained in other ways, like faulty checking for turbine blade microfractures or landing gear behavior.”

I couldn’t agree more. Extrapolating to software: The small things are typically indicators of bigger issues and could cause catastrophes down the road.

2. Track Mean Time to Retro (MTTR)

It’s important to think about what you measure in your organization. You should be measuring how you’re improving, and the most important metric here is mean time to retro (MTTR). Everyone should be tracking MTTR. It’s a great statistic to improve incident response in your team because it helps you understand the delay between incidents and retrospectives.

The easiest way to have a bad incident retro is to wait two weeks. It’s better to get into a room quickly and hash out what happened than wait a long time until you’ve got everything perfectly prepared.

Tracking MTTR can help you hold prompt and consistent retrospectives after incidents. Set a timer and make an SLO or SLA for yourself that says, “This is how long we take for retros.”

Retro time will vary depending on the severity of the incident itself. If it’s a SEV1, clear everyone’s schedules, because you need to have retro within 24 hours of the incident. For SEV3 incidents, you have much more leniency.

I also like tracking the ratio of retros to declared incidents. This is a metric that should go up; you should see your ratio of retrospectives to incidents increasing. You can break down that number by severity, as well. If you have a low retro ratio for a SEV1 versus a SEV3 incident, that might be okay at first (remember, start small) but you want those metrics to eventually become equal.

3. Alert on Degraded Experience with the Service, and Not Much Else

The severity of incidents is directly linked to customer pain. We would not declare SEV1s if there weren’t a lot of people feeling a lot of pain.

Alerting on computer vitals is an easy way to create alert fatigue and burnout. As your company starts to scale, you are going to use more CPU; you’re going to use more memory. Tying alerts to computer vitals is not a good sign.

If I run, my heart will beat faster—it’s just doing its job. Paging people at 2:00 a.m. because disc capacity is at 80% even though you won’t run out of space until next month is a good way to lose great teammates. I have worked with people who left their previous companies strictly because they got paged too many times for stuff that didn’t matter.

This is why you need to alert on a degraded experience with the service and not much else. A CPU burning hot at 90% is not necessarily a bad thing, but you need context to decide. Create SLOs that are tied to customer experience and alert on those. People experiencing problems with the service is the only thing you should alert on—for the most part.

One of the best ways I’ve seen to think about this came from Soundcloud developers, who explained that you should alert on symptoms, not causes. My fast heartbeat is not necessarily a problem. But if my elevated heart rate leads to lightheadedness and I fall—that’s a problem. So, I need to be able to alert on something like that. Paging people at 2:00 a.m. because disc capacity is at 80% and you won’t run out until next month is not good. But paging people because you know that disc capacity problems cascade into other, systemic problems is. You can apply the same thought to other potential causes of an outage. Paging alerts that wake you up in the night should only be based on symptoms.

Recent Posts By Bobby Ross
  • Choosing an Incident Management Platform
Avatar photo More from Bobby Ross
Related Posts
  • SRE’s Guide to Pragmatic Incident Response
  • Do You Need Your Ticketing System for Real-Time Incident Management?
  • The Evolution of Incident Management
    Related Categories
  • Application Performance Management/Monitoring
  • Blogs
  • Cloud Management
  • Continuous Delivery
  • Continuous Testing
  • DevOps Practice
  • IT Help Desk
    Related Topics
  • alert fatigue
  • alerting
  • incident response
  • mttr
  • SRE
Show more
Show less

Filed Under: Application Performance Management/Monitoring, Blogs, Cloud Management, Continuous Delivery, Continuous Testing, DevOps Practice, IT Help Desk Tagged With: alert fatigue, alerting, incident response, mttr, SRE

« Securing Open-Source Apps
The Next Phase of Pandemic Tech Transformation »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Cache Reserve: Eliminating the Creeping Costs of Egress Fees
Thursday, March 23, 2023 - 1:00 pm EDT
Noise Reduction And Auto-Remediation With AWS And PagerDuty AIOps
Thursday, March 23, 2023 - 3:00 pm EDT
Build Securely by Default With Harness And AWS
Tuesday, March 28, 2023 - 1:00 pm EDT

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

Postman Releases Tool for Building Apps Using APIs
March 22, 2023 | Mike Vizard
What DevOps Leadership Should Look Like
March 22, 2023 | Sanjay Gidwani
Things We Should Acknowledge, Part One: Hiring Sucks
March 22, 2023 | Don Macvittie
HPE to Acquire OpsRamp to Gain AIOps Platform
March 21, 2023 | Mike Vizard
Oracle Makes Java 20 Platform Generally Available
March 21, 2023 | Mike Vizard

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

Large Organizations Are Embracing AIOps
March 16, 2023 | Mike Vizard
What NetOps Teams Should Know Before Starting Automation Journeys
March 16, 2023 | Yousuf Khan
DevOps Adoption in Salesforce Environments is Advancing
March 16, 2023 | Mike Vizard
Grafana Labs Acquires Pyroscope to Add Code Profiling Capability
March 17, 2023 | Mike Vizard
How Open Source Can Benefit AI Development
March 16, 2023 | Bill Doerrfeld
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.