DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Container Journal
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv Video Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • Continuous Delivery
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB

Home » Blogs » DevOps Culture » How SRE Creates a Blameless Culture

How SRE Creates a Blameless Culture

Avatar photoBy: Ashar Rizqi on February 27, 2019 3 Comments

People in tech love to praise failure as essential to innovation. But at far too many organizations, when failure happens, it is punished. When an IT system goes down, an individual or a team is immediately ID’d, blamed and shamed. This is counterproductive. Blame damages companies by creating a play-it-safe atmosphere and stifling innovation; even worse, when mistakes are made, they are often hidden.

Related Posts
  • How SRE Creates a Blameless Culture
  • SRE: How Do You Get to Blameless?
  • Site Reliability Engineering (SRE) Comes of Age in 2022
    Related Categories
  • Blogs
  • DevOps Culture
  • DevOps Practice
    Related Topics
  • automation
  • devops
  • site reliability engineering
  • SRE
Show more
Show less

Fortunately, more and more organizations are recognizing the damage that blame can do and embracing a blameless culture. This is a big shift, yes, but companies can make the move to blameless—and make it stick—if they bring site reliability engineering (SRE) best practices to their technology teams.

TechStrong Con 2023Sponsorships Available

Why Blame is Bad

I’ve worked at several fast-growth companies that suffered under a culture of blame. At one, we were scaling quite rapidly. Change was constantly being introduced to the infrastructure, which meant that it broke all the time.

Because we had customers running their core business on our systems, our reliability requirements were very high—so high that for a while we stopped feature development altogether, in fear that new features would bring our systems down.

This was difficult enough, but the culture of blame made it worse. There was no psychological safety net and fingers were pointed constantly. There was an us-versus-them mentality—engineers versus non-engineers, developers versus operations, developers versus developers. Developers dedicated way too much mindshare to protecting themselves and their domain.

In a blame-rich environment, developers who should be delivering concrete results spend too much of their time in meetings trying to justify their work. There is a culture of retribution for mistakes. This leaves little room to learn and grow from failure because developers get harassed by management when they do make mistakes, or they get fired.

Your best people—those empowered to make decisions—see their responsibility undermined and go elsewhere. Creativity and innovation slow to a crawl. Meanwhile, competitors run ahead.

Why Blameless is Better

In a blameless culture, everyone feels safe and no one is afraid to make mistakes. It’s a psychologically secure environment, where true DevOps can happen. Developers feel confident enough to express their ideas and take chances. Development folks and operations folks collaborate well, and everyone is aligned on the problem.

The work environment is positive and workers are assumed to be doing their best. When mistakes happen, people aren’t blamed. Instead, any error is viewed as a manifestation of an underlying vulnerability in the systems. Attention is focused on fixing those vulnerabilities and, as a result, systems are constantly improving.

How to Get to Blameless

Let’s say you just had an outage and your database crashed. Don’t start by placing blame on a team or individual. Rather, start by conducting a postmortem. Ask questions and find the answers: Why did the database crash? Because Alex pressed the wrong button. Well, why? Because we didn’t have a system in place to enable automatic checking or a review process. So why did we not have that?

Now you’re getting to blameless and you’re reaping the benefits. You’re having a productive conversation and you’re focused on fixing your systems, not affixing blame. It’s not about Alex pressing the wrong button; it’s about how your organization is not set up to ensure success. It’s about this vital opportunity to learn from a mistake and get better. Don’t fire Alex. Put a solution in place so the problem doesn’t happen again.

How SRE Can Help

To get to a truly blameless environment, you need to implement SRE best practices. Why? Because SRE creates ultra-scalable and highly reliable software systems, by building in automation and tooling—moving the human element out of the picture and focusing on the systems.

One of the founding principles of the SRE movement is blamelessness. By implementing an SRE team and establishing SRE practices, your organization can move from the traditional operational model of break-fix to an environment that applies a development approach to IT operations, with the goal of improving reliability via automation, continuous integration and delivery.

For this to happen, you will need a SRE champion to quarterback your SRE effort. You will also need organizational buy-in to blamelessness. It has to be practiced top-down and have the full support of all leaders in the organization.

Tooling is critical, too. Look for tools and dashboards that can quickly do postmortems and find action items that need attention, that can identify the most commonly impacted services, products and customers, and the contributing factors that most often lead to incidents. The right tools will help you resolve incidents faster, by providing a clear overview of metrics related to incident identification and the actions that need to be taken.

You want to move fast and be innovative. But can you do it reliably? With a blameless culture, you can. You’ll have a safe environment that attracts and retains the best talent. You’ll move quicker, because you’ll be focused on fixing systems that break and finding out why—not the blame game. By delivering a higher degree of reliability, you’ll build greater trust with your customers.

You’ve heard the saying: Failure is not an option. In a blameless environment, failure is always an option—because it means that systems are always improving and innovation is always happening.

— Ashar Rizqi

Filed Under: Blogs, DevOps Culture, DevOps Practice Tagged With: automation, devops, site reliability engineering, SRE

« SauceCon Speaker Profile: Richard Bradshaw
VMware Brings NSX Network Virtualization into DevOps Realm »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Achieving Complete Visibility in IT Operations, Analytics, and Security
Wednesday, February 1, 2023 - 11:00 am EST
Achieving DevSecOps: Reducing AppSec Noise at Scale
Wednesday, February 1, 2023 - 1:00 pm EST
Five Best Practices for Safeguarding Salesforce Data
Thursday, February 2, 2023 - 1:00 pm EST

Sponsored Content

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Practical Approaches to Long-Term Cloud-Native Security

December 5, 2019 | Chris Tozzi

Latest from DevOps.com

Cisco AppDynamics Survey Surfaces DevSecOps Challenges
January 31, 2023 | Mike Vizard
Jellyfish Adds Tool to Visualize Software Development Workflows
January 31, 2023 | Mike Vizard
3 Performance Challenges as Chatbot Adoption Grows
January 31, 2023 | Christoph Börner
Looking Ahead, 2023 Edition
January 31, 2023 | Don Macvittie
How To Build Anti-Fragile Software Ecosystems
January 31, 2023 | Bill Doerrfeld

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

GET THE TOP STORIES OF THE WEEK

Most Read on DevOps.com

Microsoft Outage Outrage: Was it BGP or DNS?
January 25, 2023 | Richi Jennings
The Database of the Future: Seven Key Principles
January 25, 2023 | Nick Van Wiggerern
Don’t Hire for Product Expertise
January 25, 2023 | Don Macvittie
Harness Acquires Propelo to Surface Software Engineering Bot...
January 25, 2023 | Mike Vizard
Software Supply Chain Security Debt is Increasing: Here̵...
January 26, 2023 | Bill Doerrfeld
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.