Canaries were once sent into coal mines as an early warning sign against danger—for me, it was my Roomba failing to automatically search out dog hair and clean the floor under my sons’ dining room chairs. Cloud-enabled apps also were acting “weird”. Some were down, while others were just slow—although, in my view, slow may as well be down; who has the tolerance to wait for an application to load?
What caused this headache felt across the internet? An AWS disruption.
In the days since the December 7 outage, I’ve had a number of conversations with CIOs, colleagues and other industry analysts and many of our shared concerns remain unanswered. Here are a few quick thoughts on the latest us-east-1 outage.
It’s important to remember that AWS requires instances/apps be deployed across two or more regions for redundancy and resiliency. And an SLA is not only a way for service consumers to gauge the value they’re getting for their money, it’s also the contractual agreement between a cloud service provider (CSP) and a customer. It outlines the uptime, availability and performance standards to which a cloud service will adhere. It may also specify that the service provider’s helpdesk will respond to an outage within a set amount of time. Amazon has very clear SLAs. If a customer receives less than 99.99% uptime in a given month, the company will provide credits to their account.
However, there are three things to keep in mind:
A 99.99% uptime agreement still means you are willing to tolerate services being down for almost four and a half minutes per month (53 minutes per year), and you must architect your AWS instances and applications in a way that meets the AWS SLAs. The third point is that, in AWS’s view, the outage wasn’t their fault.
The cloud allows organizations to regularly make budget versus resiliency decisions. However, the recent AWS outage didn’t just impact born-on-the-web startups. Mature businesses like Disney, McDonald’s and Amazon itself were impacted by the outage. These technology-first companies aren’t sacrificing budget over resiliency. They understand that they are technology companies—they have large IT budgets that represent this priority and were still blindsided.
The recurring question is why all these organizations were crippled by a single Amazon region going down? Was it poor architecture? Did these AWS customers not fully understand AWS’ SLAs? Did they choose budget over resiliency?
Adam Selipsky’s keynote and industry analyst Q&A at re:Invent 2021 made it clear that AWS is completely focused on the enterprise. Amazon understands the needs of the DevOps community, but they are still working on developing their approach to the enterprise. Google Cloud recognized the same shortcomings in its own organization and brought Thomas Kurian from Oracle to build up the company’s enterprise and IT industry persona.
Everyone in the technology industry understands that outages and slow performance happen, but they expect service providers to provide best practices and avoid downtime. In response to Selipsky’s re:Invent keynote, Linda Jojo, executive vice president of technology and chief digital officer for United Airlines, said that the airline settled on a “one cloud” approach. Surely United Airlines can’t tolerate regular downtime.
December 7, 2021 won’t live in infamy for AWS; we have seen plenty of other outages.
Application resiliency is still an emerging area of focus. In the past, the focus was on platform and data center uptime; however, in many cases, developers have no control over the infrastructure. Therefore, teams need to focus on the resiliency of an application and assume that the underlying IaaS (cloud infrastructure) will fail for one reason or another.
In addition, Amazon needs to offer more opinionated offerings. Disaster recovery and failover should be built into every level of AWS. Software companies shouldn’t be forced to sign up for multi-region deployments, but AWS should at least be prescriptive with best practices that advise such a practice.
The emergence of low/no-code platforms is challenging traditional notions of coding expertise. Gone are the days when coding was an…
Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…
Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…
We're going to send email messages that say, "Hope this finds you in a well" and see if anybody notices.
I am happy and proud to announce with Daniel Newman, CEO of Futurum Group, an agreement under which Futurum has…
Most developers are using some form of DevOps practices, reports the CDF survey. Adopting STANDARD DevOps practices? Not so much.