DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB

Home » Blogs » Why More Incidents Are Better

Why More Incidents Are Better

Avatar photoBy: Andre King on July 15, 2022 Leave a Comment

Ask most SREs how many incidents they’d have to respond to in a perfect world, and their answer would probably be ‘zero.’ After all, making software and infrastructure so reliable that incidents never occur is the dream that SREs are theoretically chasing.

Reducing the number of actual incidents as much as possible is a noble goal. However, it’s important to recognize that incidents aren’t an SRE’s number-one enemy. What matters more than the number of incidents you experience is how effectively you respond to each one.

TechStrong Con 2023Sponsorships Available

Plus, there’s value in incidents. They are a learning opportunity. If your business never experienced them, it would arguably be facing more risk, not less.

We know: These ideas may sound a little counterintuitive. You might even accuse us of being “pro-incident”—which we sort of are. Allow us to explain.

The Silver Lining

In many respects, incidents are inherently bad. When one occurs, it means something broke. That’s bad. It may also mean that users were disrupted, operations halted or money was lost. Those things are even worse.

On the other hand, incidents aren’t all bad. They actually benefit SRE teams for several reasons:

  • Learning opportunities: Incidents are opportunities to figure out what went wrong and prevent it from recurring. They can also help teams learn how to react more quickly or efficiently the next time something fails.
  • Get ahead of bigger issues: Sometimes, working through one incident means you can avoid another that’s even worse. Perhaps one server fails, for example, and your response revealed that the failure was due to a larger issue that would have eventually caused a worse outage if left unaddressed. But thanks to the incident, you detected the larger issue before it triggered a more massive failure.
  • Reinforce team culture: Nothing breeds camaraderie or a spirit of collaboration like working alongside other engineers to respond to a crisis in the middle of the night. Although being in this setting may not be anyone’s first choice, it does often have a positive impact on your team’s culture and esprit de corps.
  • Demonstrating value: Assuming you handle them well, incidents are an opportunity for SREs to prove how valuable they are to the organization. If incidents never happened, it’s a safe bet that some bosses would start to wonder why they need SREs in the first place. (It would be a flawed train of thought, of course, because SREs would deserve credit for preventing incidents, but it’s a thought that may float around some C-level brains nonetheless.)

We could go on, but the point is clear: Although incidents cause problems in some respects, they actually create value in others.

Focus on Response, Not Avoidance

This is not to say that you should welcome incidents with open arms. Obviously, any decent SRE should focus first and foremost on being proactive and preventing incidents from happening whenever possible. They should use chaos engineering to identify problems that could be lurking unseen in production environments. They should leverage IaC to minimize risks. And so on.

That said, what ultimately matters more than incident frequency is the effectiveness of incident response. It’s better to experience ten incidents that you resolve in under an hour each than one incident that takes mission-critical systems offline for a week.

So, in addition to investing in tools and processes that mitigate the risk of incidents, SRE teams should place equal emphasis on ensuring that they can react quickly and effectively when an incident happens. This means having the ability to share information efficiently, define clear roles, know what to prioritize when working through complex incidents and have clear plans in place that spell out how you’ll handle a problem as soon as you detect it. Without these abilities, you’re at risk of letting incidents that should be small turn into major outages.

‘Zero Incidents’ is not Realistic

It’s important to recognize, too, that while it can be fun to imagine a world where zero incidents occur, the reality is that such a world will never exist. If it could, we wouldn’t see new records set each year for the number of security incidents that businesses collectively suffer, for example.

Nor would we see headlines about major outages at huge enterprises like Facebook or AWS on a recurring basis. If those companies, which have world-class reliability teams and virtually endless resources at their disposal, can’t reduce incidents to zero, neither can anyone else.

Conclusion

The bottom line: There is no such thing as total incident prevention, no matter how hard you try. And even if there were, that wouldn’t actually be a good thing, for the reasons explained above.

So, by all means, undertake reasonable proactive efforts to prevent as many incidents as you can from happening. But don’t let investment in prevention cause under-investment in response. Being prepared to handle incidents when they happen—which they inevitably will—is what matters most.

Related Posts
  • Why More Incidents Are Better
  • Survey Reveals Slight Decline in Level of SRE Toil
  • Best of 2022: Day in the Life of a Site Reliability Engineer (SRE)
    Related Categories
  • Application Performance Management/Monitoring
  • Blogs
  • DevOps Culture
  • DevOps Practice
  • Editorial Calendar
  • SRE
    Related Topics
  • application performance management
  • incident response
  • site reliability engineering
  • SRE
Show more
Show less

Filed Under: Application Performance Management/Monitoring, Blogs, DevOps Culture, DevOps Practice, Editorial Calendar, SRE Tagged With: application performance management, incident response, site reliability engineering, SRE

« The Future of Low-Code Development
Deloitte Aligns with Dynatrace for Observability »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Evolution of Transactional Databases
Monday, January 30, 2023 - 3:00 pm EST
Moving Beyond SBOMs to Secure the Software Supply Chain
Tuesday, January 31, 2023 - 11:00 am EST
Achieving Complete Visibility in IT Operations, Analytics, and Security
Wednesday, February 1, 2023 - 11:00 am EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

Stream Big, Think Bigger: Analyze Streaming Data at Scale
January 27, 2023 | Julia Brouillette
What’s Ahead for the Future of Data Streaming?
January 27, 2023 | Danica Fine
The Strategic Product Backlog: Lead, Follow, Watch and Explore
January 26, 2023 | Chad Sands
Atlassian Extends Automation Framework’s Reach
January 26, 2023 | Mike Vizard
Software Supply Chain Security Debt is Increasing: Here’s How To Pay It Off
January 26, 2023 | Bill Doerrfeld

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

What DevOps Needs to Know About ChatGPT
January 24, 2023 | John Willis
Microsoft Outage Outrage: Was it BGP or DNS?
January 25, 2023 | Richi Jennings
Five Great DevOps Job Opportunities
January 23, 2023 | Mike Vizard
Optimizing Cloud Costs for DevOps With AI-Assisted Orchestra...
January 24, 2023 | Marc Hornbeek
A DevSecOps Process for Node.js Projects
January 23, 2023 | Gilad David Maayan
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.