“Our service should always be up.” Some myths just won’t die.
Engineering for reliability is well understood by engineering leaders, less so by bosses demanding unreasonable uptime with minimal resources and immense feature pressure. Business leaders tend to waffle between ignoring reliability with a hand wave and freaking out after an outage. How can they understand the reality of reliability and forget the myths?
Myth One: Our Service Should Always be Up
Reality: Our Service is Engineered to Consistently Exceed Expectations
Some might imagine reliability like a light switch—it is either on or off. But reliability is about consistent repetitions and managing the risk of minor, occasional, recoverable failures. Can we consistently exceed customer expectations?
Understanding customer expectations is challenging because you can’t ask people what they expect from you (unless you have a peculiar customer). So instead, we want to define and quantify the impact of our service not working and how that could negatively impact customer experience based on observed behavior. For example, “People abandon their shopping cart when our checkout experience takes more than 10 seconds to load, which means we lose money.”
We reflect business impact in our reliability goals and then engineer our system and processes to meet this goal. We might change how we do releases, ensure tests pass or avoid changes during times of peak usage. These engineering decisions use reliability as a business metric based on customer expectations.
Myth Two: Innovation is More Valuable Than Reliability
Reality: You Need to Balance Innovation and Reliability Engineering Work
The constant drive for new features over reliability is the most frustrating myth. “We need to release new feature X, or we won’t have customers; we can worry about reliability later.” In reality, most customers care about reliability; they don’t mention it until it’s a problem.
You can have the most fantastic whizbang feature in the world, but if it doesn’t work–and Murphy’s Law will make sure that it doesn’t work at the worst possible time–no one can use it, no one will be impressed and it will turn your technology into a laughingstock.
Innovation excites customers, but trust comes from reliability, which you must earn through hard work and clever engineering. Depending on your business context, you may require reliability more than ever. If you are facing headwinds, you may need to scale back your ambitions regarding innovative features. Still, you can’t scrimp on reliability or customers will justifiably leave.
Myth Three: Five Nines is Normal and Incremental From Four Nines
Reality: Five Nines is Expensive—10X the Cost of Four Nines
No one–not even massive cloud providers or telcos–can consistently deliver at 99.999% across all their services by accident. Reliability at that level–less than six minutes downtime per year!–is an engineering marvel. A bridge or a dam might look simple after completion, but the engineering required to create a reliable physical infrastructure is immense, as everyone knows. Why is it so hard to understand the complexity, design, engineering and redundancy required to deliver a highly available and performant digital system? Further, it’s easy to think that 99.999% is just a bit more than 99.99%. After all, it’s “just one more nine.” Remind your less-technical counterparts that each nine requires ten times more effort!
Why is it so expensive to deliver? Because the failure tolerance (also known as an error budget) is 1/10th the size but the risk of missing the goal increases exponentially. You’ll need more redundancy, careful testing and certification of releases, increased on-call rotations, extra hardware or cloud capacity and extensively tested backup plans to achieve this goal.
Worst of all, higher reliability will slow you down. You can’t innovate or deliver updates as fast when you need to ensure absolute uptime.
But what if there was a limit to how much reliability we need?
Myth Four: More Reliability is Always Good
Reality: Reliability Engineering Has Diminishing Returns
There is a point at which being “too reliable” is terrible for business. It’s expensive to build all that redundancy, testing, responding to tiny glitches and all the rest. And most of your users won’t notice. We must avoid the large blowups that put us in the headlines and manage expectations everywhere else. The significant outages that can impact thousands, if not millions of customers come from this reductionist view of reliability as a by-product of conscientious work rather than an engineering problem with well-defined tolerances and thresholds. You earn the trust of your customers by properly engineering reliability into your delivery process.
Consider tardiness at meetings. If you wanted to be 99% on time, you’d need to join a one-hour call within 36 seconds and at 99.9%, you would need to enter a Zoom within 3.6 seconds of its start time, a timescale so small you don’t even notice it. You would have to do this for every meeting you attended, no excuses–bio breaks, last meeting ran long, someone at the door, etc. None of these things matter when defining and measuring reliability. This metaphor also provides a common sense way to think about risk and error budgets. Your other meeting attendees can’t possibly notice (or care) if you’re 3.6 seconds late, no matter how prestigious or impatient the other party is.
You could apply this same reasoning to catching a flight, picking up your kids from school, completing an exam, building a woodworking project or any human endeavor. The concept is so intuitive to daily life that even pointing it out seems absurd. But this is the fundamental concept from which reliability engineering stems. To build a reliable system, we must define acceptable failure boundaries. Otherwise, we will spend precious time and resources to eliminate the 3.6 seconds of delay that no one cares about and miss the more significant issues–like being present and engaged in the discussion.
Busting Reliability Myths
Understanding reliability is vital for engineers and business people alike. It all comes down to intentionally designing a customer experience, keeping up with expectations and, in some cases, even promises. Right-sizing reliability lets you find the perfect balance between delivering excellent service and efficiently running your organization.
Image Source: Indira Tjokorda via Unsplash