In today’s always-on, ever-connected world, we all expect 100% availability.
What gets in the way of this? The devil is in the details. Over time, everything breaks: Disks, nodes, containers, networks, DNS servers and configuration issues can all lead to major outages.
Amazon’s 13 minutes of downtime in August 2021 translated to almost $5 million in lost revenue and Amazon Web Services’ three outages in December of last year cost millions more. And it was widely reported that the October 2021 Facebook (including related products: Messenger, Instagram and WhatsApp) outage resulted in an estimated $100 million in lost revenue due to a network configuration error.
There have been many outages of varying scope and duration—some attributed to technology breaking, others traced back to people making mistakes. In fact, a 2017 Amazon outage was attributed to “human error”–or more specifically, one employee’s typo. The employee intended to take a small number of servers offline to fix a billing system issue but the incorrect command led to several servers being affected and a major outage.
So, what can your company do to reduce the risk, duration and impact of a potential outage? For one, companies can’t pretend that if they only try harder, they can prevent humans from making mistakes. Do you have dozens of people manually keying in hundreds of commands every day? If so, a mistake is inevitable. Companies should instead investigate how and why one small blunder in a command line could do widespread damage. Guardrails and redundancies should be in place to protect against and minimize these types of incidents.
As the enterprise continues its digital transformation toward a multi-cloud, hybrid and distributed world, here are five guidelines to consider—guidelines that can help you preempt and prevent your next big outage.
1. Always Consider Your Blast Radius
Blast radius measures how one error, typo or malfunction can impact other parts of a system.
What is the chain of components or services that is impacted by one error?
What are the hidden dependencies?
How many customers would be affected and for how long?
More is connected than you think; discover how it’s all related in advance to eliminate surprises.
If too many systems in a network are mutually dependent and interconnected, the result is deadlock. In other words, when one breaks, the other also fails. We have seen this in past Facebook outages. A single erroneous command sparked a domino effect that shut down the backbone connecting all of Facebook’s data centers globally.
How can you prevent this scenario? Plan for the worst possible scenario. When it comes to mission-critical systems, take an intentional approach to situation failover and backup. What if A, B and C all happen simultaneously? Also, think about your cost versus reliability trade-offs. Maybe you can’t afford two full sets of servers but can afford a full set of archived code.
2. Smooth Out Rough Edges
Consider that core infrastructure and code get used (and tested) a lot more than edge code/infrastructure. Basically, the core is hardened while edge infrastructure is more vulnerable. You should be more proactive at the edge around identifying vulnerabilities and doing explicit testing to counter this risk.
The 2014 Google disk erase/satellite outage is a cautionary tale involving the edge. In Google’s case, the edge networks offered a key advantage: A better connection for the customer due to lower latency since data is accessed from a source closer to the user as opposed to the core network. Google had a CDN/edge network for TCP connection termination, caching for various apps, DNS regulation, etc. Teams needed to routinely upgrade a satellite rack and run decommission automation—and this automation failed part of the way through. Error message. They retried the decommission process again to debug the failure. The manual re-execution of the decommission automation meant thousands of satellite machines were decommissioned and, within minutes, the disks of all satellite machines globally were erased. The result: Customers experienced an increase in latency. Behind the scenes, it required two days of all-hands-on-deck to reinstall these machines.
3. Think Escalator, Not Elevator
A good system has both defense and depth. Considering that it takes three or four things to (really) go wrong, think about failure cascades. Bluntly put, limit dependencies. Measure the longest dependency failure path and mean-time-to-recovery (MTTR). When you do this, think escalator not elevator. What do I mean? Well, when an elevator fails, you are left with a shaft; an escalator that fails still works as a staircase.
In 2016, severe weather in Sydney, Australia resulted in an AWS EC2/EBS outage. First, the utility provider suffered a loss of power at a regional substation. This failure resulted in a total loss of utility power to multiple AWS facilities. In one of the facilities, AWS power redundancy didn’t work as designed and power was lost to a significant number of instances in that availability zone. To make matters worse, the instances that lost power lost access to both their primary and secondary power as several power delivery lineups failed to transfer load to their generators. Also, a latent bug in AWS instance management software resulted in a slower-than-expected recovery of the remaining instances.
The lesson? This correlated power delivery lineup failure illustrated that failures are not independent. You must design for the largest correlated failure in combination with the background noise of ongoing independent failure. Recovery code needs to be scale-tested against the largest expected failure.
4. Stay Ahead of Digital Certificate Rotations
Don’t underestimate the importance of digital certificates—and their expiration dates. A few years ago, LinkedIn allowed one of its SSL certificates to expire and it knocked out LinkedIn sites in the U.S., UK and Canada. In 2017, Equifax missed a breach for 76 days because of an expired certificate. The average annual cost of unplanned outages due to certificate expirations is $11.1 million.
For example, you could have a load balancer sitting in front of several servers. The load balancer could have an up-to-date certificate while the servers behind it are expiring. It is important to check for expiring servers in multiple ways. Ideally, you check the boxes in two ways: Ping the HTTP endpoints and also have scripts that check the certificate files themselves.
Expired certificates (TLS for example) can lead to encryption and authentication risks as well as lost internet connectivity and larger outages involving websites and applications. To avoid this, you continuously check, renew and deploy certificates—and this can be achieved via automation. Think about automatically rotating expiring certificates.
5. Tackle Toil to Empower Your Teams
What’s the biggest headache for site reliability engineers (SREs) and those who manage production operations? Toil.
We can define toil as manual, tedious tasks that engineers perform within production environments. Basically, toil wastes time and slows down operations. As a result, most SREs and engineering teams strive to minimize toil within their workflows.
In production operations, you should work to limit toil. Make sure your team is regularly investing time in proactive work; the reactive work shouldn’t take most of their time. Toil reduction gives developers, DevOps engineers and SREs more time to focus on projects that create net-new value for the business (new application features). Google stated that it has an “advertised goal” of keeping toil “below 50% of each SRE’s time.” Basically, no engineer should spend more than half of their time on manual, repetitive work. But the ultimate goal should be to get toil as close to zero as possible. Automating incident remediation can help SREs spend less time on toil and less time on-call.
Also, when you implement fixes—especially people and process fixes—ask yourself if this is a systemic fix or one that is relying on best intentions. Look for tools to safely empower your team (with guardrails). As you work to identify major vulnerabilities and dependencies (as discussed above), you should automate away the most common issues while planning for the “corner cases” (or the 1% of 1%) that may lead to major outages (a.k.a. the worst-case scenarios).
As you consider these guidelines, make sure you focus on putting actual mechanisms in place—not one-off, “hopeful” directives. Do not rely on people to remember best practices from a training session—these are not mechanisms. Instead, design processes that can be systematically implemented. When you do a post-mortem, look for concrete programs you can put into place that will prevent the same mistake from happening again. Automation can help you achieve this level of confidence, management and quality control—and it may just keep you one step ahead of a potential future outage.