In my past experience as an SRE, I learned some valuable lessons about how to respond to and learn from incidents. If you want the TL;DR, I’ll summarize them here:
- Declare and run retros for the small incidents. It’s less stressful, and action items become much more actionable.
- Decrease the time it takes to analyze an incident. You’ll remember more, and will learn more from the incident.
- Alert on pain felt by people – not machines. The only reason we declare incidents at all is because of the people on the other side of those machines.
Now, let’s dive into each of these lessons a little deeper, and explore how they can help you build a better system for pragmatic incident response.
1. Focus on the Small Incidents First
The bad habit of ignoring small issues often leads to bigger issues. You should run retrospectives for small incidents (slowdowns, minor bugs, etc.) because these often have the most actionable takeaways—instead of shooting the moon and creating “rearchitect async pipeline” as a Jira ticket that never happens. Focusing on low-stakes incidents and retrospectives are a great introduction to behavior change across your organization.
Let’s look at an obvious example of how a focus on small incidents can have an outsized impact. Case in point—my apartment. I live in an old candy factory retrofitted into apartment units. We have an elevator (thank God), but some of the buttons don’t light up when you press them. The LED display doesn’t match up with the numbers on the floor buttons. Yes, the elevator goes up and down, but overall, you can tell that things are wrong with it.
One day, I came home and noticed an “Out of Order” notice on the elevator door. Was I surprised? Not at all. Of course an elevator with mislabeled buttons and broken LEDs would stop working!
This isn’t dissimilar from an important software lesson—ignoring small issues often leads to much bigger issues. What starts as an inconvenience can lead to a complete breakdown. This is why you should focus on the small incidents first. I hear a lot of companies say, “We need to fix incident response! Every time we have an incident, it’s just chaos!”
They are talking about unexpected downtime during a high severity level incident and they want to improve it.
I say that’s the wrong type of incident to use to fix your incident response system and process. Focus on the small incidents first. Running retros for small incidents can help you build strong incident response models, because they have the most actionable takeaways and they are the best way to change behavior through repetition and practice.
If you have a high-stakes, 12-hour incident and you run a one-hour retrospective you’re not going to get the results you want. You need to start small. Run retros for bugs that were introduced; maybe a bad data migration that didn’t really impact anything but took up a couple of hours of your day.
Heidi Waterhouse captured this idea really well in this piece on reliability. Every airplane you’ve ever flown on has many tiny problems, Waterhouse said, “… like a sticky luggage latch or a broken seat or a frayed seatbelt. None of these problems alone are cause to ground the plane. However, if enough small problems compound, the plane may no longer meet the requirements for passenger airworthiness and the airline will ground it. A plane with many malfunctioning call buttons may also be poorly maintained in other ways, like faulty checking for turbine blade microfractures or landing gear behavior.”
I couldn’t agree more. Extrapolating to software: The small things are typically indicators of bigger issues and could cause catastrophes down the road.
2. Track Mean Time to Retro (MTTR)
It’s important to think about what you measure in your organization. You should be measuring how you’re improving, and the most important metric here is mean time to retro (MTTR). Everyone should be tracking MTTR. It’s a great statistic to improve incident response in your team because it helps you understand the delay between incidents and retrospectives.
The easiest way to have a bad incident retro is to wait two weeks. It’s better to get into a room quickly and hash out what happened than wait a long time until you’ve got everything perfectly prepared.
Tracking MTTR can help you hold prompt and consistent retrospectives after incidents. Set a timer and make an SLO or SLA for yourself that says, “This is how long we take for retros.”
Retro time will vary depending on the severity of the incident itself. If it’s a SEV1, clear everyone’s schedules, because you need to have retro within 24 hours of the incident. For SEV3 incidents, you have much more leniency.
I also like tracking the ratio of retros to declared incidents. This is a metric that should go up; you should see your ratio of retrospectives to incidents increasing. You can break down that number by severity, as well. If you have a low retro ratio for a SEV1 versus a SEV3 incident, that might be okay at first (remember, start small) but you want those metrics to eventually become equal.
3. Alert on Degraded Experience with the Service, and Not Much Else
The severity of incidents is directly linked to customer pain. We would not declare SEV1s if there weren’t a lot of people feeling a lot of pain.
Alerting on computer vitals is an easy way to create alert fatigue and burnout. As your company starts to scale, you are going to use more CPU; you’re going to use more memory. Tying alerts to computer vitals is not a good sign.
If I run, my heart will beat faster—it’s just doing its job. Paging people at 2:00 a.m. because disc capacity is at 80% even though you won’t run out of space until next month is a good way to lose great teammates. I have worked with people who left their previous companies strictly because they got paged too many times for stuff that didn’t matter.
This is why you need to alert on a degraded experience with the service and not much else. A CPU burning hot at 90% is not necessarily a bad thing, but you need context to decide. Create SLOs that are tied to customer experience and alert on those. People experiencing problems with the service is the only thing you should alert on—for the most part.
One of the best ways I’ve seen to think about this came from Soundcloud developers, who explained that you should alert on symptoms, not causes. My fast heartbeat is not necessarily a problem. But if my elevated heart rate leads to lightheadedness and I fall—that’s a problem. So, I need to be able to alert on something like that. Paging people at 2:00 a.m. because disc capacity is at 80% and you won’t run out until next month is not good. But paging people because you know that disc capacity problems cascade into other, systemic problems is. You can apply the same thought to other potential causes of an outage. Paging alerts that wake you up in the night should only be based on symptoms.