Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be ‘zero.’ After all, making software and infrastructure so reliable that incidents never occur is the dream that SREs are theoretically chasing.
Reducing the number of actual incidents as much as possible is a noble goal. However, it’s important to recognize that incidents aren’t an SRE’s number-one enemy. What matters more than the number of incidents you experience is how effectively you respond to each one.
Plus, there’s value in incidents. They are a learning opportunity. If your business never experienced them, it would arguably be facing more risk, not less.
We know: These ideas may sound a little counterintuitive. You might even accuse us of being “pro-incident”—which we sort of are. Allow us to explain.
The Silver Lining
In many respects, incidents are inherently bad. When one occurs, it means something broke. That’s bad. It may also mean that users were disrupted, operations halted or money was lost. Those things are even worse.
On the other hand, incidents aren’t all bad. They actually benefit SRE teams for several reasons:
- Learning opportunities: Incidents are opportunities to figure out what went wrong and prevent it from recurring. They can also help teams learn how to react more quickly or efficiently the next time something fails.
- Get ahead of bigger issues: Sometimes, working through one incident means you can avoid another that’s even worse. Perhaps one server fails, for example, and your response revealed that the failure was due to a larger issue that would have eventually caused a worse outage if left unaddressed. But thanks to the incident, you detected the larger issue before it triggered a more massive failure.
- Reinforce team culture: Nothing breeds camaraderie or a spirit of collaboration like working alongside other engineers to respond to a crisis in the middle of the night. Although being in this setting may not be anyone’s first choice, it does often have a positive impact on your team’s culture and esprit de corps.
- Demonstrating value: Assuming you handle them well, incidents are an opportunity for SREs to prove how valuable they are to the organization. If incidents never happened, it’s a safe bet that some bosses would start to wonder why they need SREs in the first place. (It would be a flawed train of thought, of course, because SREs would deserve credit for preventing incidents, but it’s a thought that may float around some C-level brains nonetheless.)
We could go on, but the point is clear: Although incidents cause problems in some respects, they actually create value in others.
Focus on Response, Not Avoidance
This is not to say that you should welcome incidents with open arms. Obviously, any decent SRE should focus first and foremost on being proactive and preventing incidents from happening whenever possible. They should use chaos engineering to identify problems that could be lurking unseen in production environments. They should leverage IaC to minimize risks. And so on.
That said, what ultimately matters more than incident frequency is the effectiveness of incident response. It’s better to experience ten incidents that you resolve in under an hour each than one incident that takes mission-critical systems offline for a week.
So, in addition to investing in tools and processes that mitigate the risk of incidents, SRE teams should place equal emphasis on ensuring that they can react quickly and effectively when an incident happens. This means having the ability to share information efficiently, define clear roles, know what to prioritize when working through complex incidents and have clear plans in place that spell out how you’ll handle a problem as soon as you detect it. Without these abilities, you’re at risk of letting incidents that should be small turn into major outages.
‘Zero Incidents’ is not Realistic
It’s important to recognize, too, that while it can be fun to imagine a world where zero incidents occur, the reality is that such a world will never exist. If it could, we wouldn’t see new records set each year for the number of security incidents that businesses collectively suffer, for example.
Nor would we see headlines about major outages at huge enterprises like Facebook or AWS on a recurring basis. If those companies, which have world-class reliability teams and virtually endless resources at their disposal, can’t reduce incidents to zero, neither can anyone else.
Conclusion
The bottom line: There is no such thing as total incident prevention, no matter how hard you try. And even if there were, that wouldn’t actually be a good thing, for the reasons explained above.
So, by all means, undertake reasonable proactive efforts to prevent as many incidents as you can from happening. But don’t let investment in prevention cause under-investment in response. Being prepared to handle incidents when they happen—which they inevitably will—is what matters most.