DevOps.com

  • Latest
    • Articles
    • Features
    • Most Read
    • News
    • News Releases
  • Topics
    • AI
    • Continuous Delivery
    • Continuous Testing
    • Cloud
    • Culture
    • DataOps
    • DevSecOps
    • Enterprise DevOps
    • Leadership Suite
    • DevOps Practice
    • ROELBOB
    • DevOps Toolbox
    • IT as Code
  • Videos/Podcasts
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
    • DevOps Unbound
  • Webinars
    • Upcoming
    • Calendar View
    • On-Demand Webinars
  • Library
  • Events
    • Upcoming Events
    • Calendar View
    • On-Demand Events
  • Sponsored Content
  • Related Sites
    • Techstrong Group
    • Cloud Native Now
    • Security Boulevard
    • Techstrong Research
    • DevOps Chat
    • DevOps Dozen
    • DevOps TV
    • Techstrong TV
    • Techstrong.tv Podcast
    • Techstrong.tv - Twitch
  • Media Kit
  • About
  • Sponsor
  • AI
  • Cloud
  • CI/CD
  • Continuous Testing
  • DataOps
  • DevSecOps
  • DevOps Onramp
  • Platform Engineering
  • Sustainability
  • Low-Code/No-Code
  • IT as Code
  • More
    • Application Performance Management/Monitoring
    • Culture
    • Enterprise DevOps
    • ROELBOB
Hot Topics
  • How to Build Successful DevOps Teams
  • Five Great DevOps Job Opportunities
  • Serial Entrepreneur
  • Chronosphere Adds Professional Services to Jumpstart Observability
  • Friend or Foe? ChatGPT's Impact on Open Source Software

Home » Blogs » Debunking Myths About Reliability

Debunking Myths About Reliability

By: Kit Merker on March 31, 2023 Leave a Comment

“Our service should always be up.” Some myths just won’t die.

Engineering for reliability is well understood by engineering leaders, less so by bosses demanding unreasonable uptime with minimal resources and immense feature pressure. Business leaders tend to waffle between ignoring reliability with a hand wave and freaking out after an outage. How can they understand the reality of reliability and forget the myths?

Cloud Native NowSponsorships Available

Myth One: Our Service Should Always be Up

Reality: Our Service is Engineered to Consistently Exceed Expectations

Some might imagine reliability like a light switch—it is either on or off. But reliability is about consistent repetitions and managing the risk of minor, occasional, recoverable failures. Can we consistently exceed customer expectations?

Understanding customer expectations is challenging because you can’t ask people what they expect from you (unless you have a peculiar customer). So instead, we want to define and quantify the impact of our service not working and how that could negatively impact customer experience based on observed behavior. For example, “People abandon their shopping cart when our checkout experience takes more than 10 seconds to load, which means we lose money.”

We reflect business impact in our reliability goals and then engineer our system and processes to meet this goal. We might change how we do releases, ensure tests pass or avoid changes during times of peak usage. These engineering decisions use reliability as a business metric based on customer expectations.

Myth Two: Innovation is More Valuable Than Reliability

Reality: You Need to Balance Innovation and Reliability Engineering Work

The constant drive for new features over reliability is the most frustrating myth. “We need to release new feature X, or we won’t have customers; we can worry about reliability later.” In reality, most customers care about reliability; they don’t mention it until it’s a problem.

You can have the most fantastic whizbang feature in the world, but if it doesn’t work–and Murphy’s Law will make sure that it doesn’t work at the worst possible time–no one can use it, no one will be impressed and it will turn your technology into a laughingstock.

Innovation excites customers, but trust comes from reliability, which you must earn through hard work and clever engineering. Depending on your business context, you may require reliability more than ever. If you are facing headwinds, you may need to scale back your ambitions regarding innovative features. Still, you can’t scrimp on reliability or customers will justifiably leave.

Myth Three: Five Nines is Normal and Incremental From Four Nines

Reality: Five Nines is Expensive—10X the Cost of Four Nines

No one–not even massive cloud providers or telcos–can consistently deliver at 99.999% across all their services by accident. Reliability at that level–less than six minutes downtime per year!–is an engineering marvel. A bridge or a dam might look simple after completion, but the engineering required to create a reliable physical infrastructure is immense, as everyone knows. Why is it so hard to understand the complexity, design, engineering and redundancy required to deliver a highly available and performant digital system? Further, it’s easy to think that 99.999% is just a bit more than 99.99%. After all, it’s “just one more nine.” Remind your less-technical counterparts that each nine requires ten times more effort!

Why is it so expensive to deliver? Because the failure tolerance (also known as an error budget) is 1/10th the size but the risk of missing the goal increases exponentially. You’ll need more redundancy, careful testing and certification of releases, increased on-call rotations, extra hardware or cloud capacity and extensively tested backup plans to achieve this goal.

Worst of all, higher reliability will slow you down. You can’t innovate or deliver updates as fast when you need to ensure absolute uptime.

But what if there was a limit to how much reliability we need?

Myth Four: More Reliability is Always Good

Reality: Reliability Engineering Has Diminishing Returns

There is a point at which being “too reliable” is terrible for business. It’s expensive to build all that redundancy, testing, responding to tiny glitches and all the rest. And most of your users won’t notice. We must avoid the large blowups that put us in the headlines and manage expectations everywhere else. The significant outages that can impact thousands, if not millions of customers come from this reductionist view of reliability as a by-product of conscientious work rather than an engineering problem with well-defined tolerances and thresholds. You earn the trust of your customers by properly engineering reliability into your delivery process.

Consider tardiness at meetings. If you wanted to be 99% on time, you’d need to join a one-hour call within 36 seconds and at 99.9%, you would need to enter a Zoom within 3.6 seconds of its start time, a timescale so small you don’t even notice it. You would have to do this for every meeting you attended, no excuses–bio breaks, last meeting ran long, someone at the door, etc. None of these things matter when defining and measuring reliability. This metaphor also provides a common sense way to think about risk and error budgets. Your other meeting attendees can’t possibly notice (or care) if you’re 3.6 seconds late, no matter how prestigious or impatient the other party is.

You could apply this same reasoning to catching a flight, picking up your kids from school, completing an exam, building a woodworking project or any human endeavor. The concept is so intuitive to daily life that even pointing it out seems absurd. But this is the fundamental concept from which reliability engineering stems. To build a reliable system, we must define acceptable failure boundaries. Otherwise, we will spend precious time and resources to eliminate the 3.6 seconds of delay that no one cares about and miss the more significant issues–like being present and engaged in the discussion.

Busting Reliability Myths

Understanding reliability is vital for engineers and business people alike. It all comes down to intentionally designing a customer experience, keeping up with expectations and, in some cases, even promises. Right-sizing reliability lets you find the perfect balance between delivering excellent service and efficiently running your organization.

Image Source: Indira Tjokorda via Unsplash 

Recent Posts By Kit Merker
  • SREs: Stop Asking Your Product Managers for SLOs
More from Kit Merker
Related Posts
  • Debunking Myths About Reliability
  • 7 Important Truths About Chaos Engineering
  • How to Adopt an SRE Practice (When You’re not Google)
    Related Categories
  • Application Performance Management/Monitoring
  • Blogs
  • DevOps Culture
  • DevOps Practice
    Related Topics
  • debunking myths
  • reliability
  • SRE
  • systems engineering
Show more
Show less

Filed Under: Application Performance Management/Monitoring, Blogs, DevOps Culture, DevOps Practice Tagged With: debunking myths, reliability, SRE, systems engineering

« New Relic Bets on AI to Advance Observability
5 Key Performance Metrics to Track in 2023 »

Techstrong TV – Live

Click full-screen to enable volume control
Watch latest episodes and shows

Upcoming Webinars

Securing Your Software Supply Chain with JFrog and AWS
Tuesday, June 6, 2023 - 1:00 pm EDT
Maximize IT Operations Observability with IBM i Within Splunk
Wednesday, June 7, 2023 - 1:00 pm EDT
Secure Your Container Workloads in Build-Time with Snyk and AWS
Wednesday, June 7, 2023 - 3:00 pm EDT

GET THE TOP STORIES OF THE WEEK

Sponsored Content

PlatformCon 2023: This Year’s Hottest Platform Engineering Event

May 30, 2023 | Karolina Junčytė

The Google Cloud DevOps Awards: Apply Now!

January 10, 2023 | Brenna Washington

Codenotary Extends Dynamic SBOM Reach to Serverless Computing Platforms

December 9, 2022 | Mike Vizard

Why a Low-Code Platform Should Have Pro-Code Capabilities

March 24, 2021 | Andrew Manby

AWS Well-Architected Framework Elevates Agility

December 17, 2020 | JT Giri

Latest from DevOps.com

How to Build Successful DevOps Teams
June 5, 2023 | Mariusz Tomczyk
Five Great DevOps Job Opportunities
June 5, 2023 | Mike Vizard
Chronosphere Adds Professional Services to Jumpstart Observability
June 2, 2023 | Mike Vizard
Friend or Foe? ChatGPT’s Impact on Open Source Software
June 2, 2023 | Javier Perez
VMware Streamlines IT Management via Cloud Foundation Update
June 2, 2023 | Mike Vizard

TSTV Podcast

On-Demand Webinars

DevOps.com Webinar ReplaysDevOps.com Webinar Replays

Most Read on DevOps.com

No, Dev Jobs Aren’t Dead: AI Means ‘Everyone’s a Programmer’? ¦ Interesting Intel VPUs
June 1, 2023 | Richi Jennings
What Is a Cloud Operations Engineer?
May 30, 2023 | Gilad David Maayan
Forget Change, Embrace Stability
May 31, 2023 | Don Macvittie
Five Great DevOps Job Opportunities
May 30, 2023 | Mike Vizard
Revolutionizing the Nine Pillars of DevOps With AI-Engineered Tools
June 2, 2023 | Marc Hornbeek
  • Home
  • About DevOps.com
  • Meet our Authors
  • Write for DevOps.com
  • Media Kit
  • Sponsor Info
  • Copyright
  • TOS
  • Privacy Policy

Powered by Techstrong Group, Inc.

© 2023 ·Techstrong Group, Inc.All rights reserved.